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PREFACE 


Dr. D. Basu’s pioneering contributions to statistics started at the 
beginning of the fifties. For about four decades, Dr. Basu, in many of his 
fundamental writings, has examined critically the foundations of statistical 
inference, concepts such as information, likelihood, invariance, ancillarity, 
randomization, fiducial probabilities, logical foundations of survey sampling, and 
many related concepts. His research has led to some path-breaking results such 
as independence of ancillary and boundedly complete sufficient statistics, 
characterization of sufficiency in finite population sampling, the design 
independence of Bayesian inference procedures in sample surveys, to name a few. 
His research has influenced several generations of statisticians, and will continue 
to do so for years to come. Most of Dr. Basu’s critical essays are now collected in 
a Springer volume entitled Statistical Information and Likelihood, thanks to the 
efforts of Professor J.K. Ghosh. 

Professor Basu was born on July 5, 1924, in Dacca, now in Bangladesh. 
He received a Master’s Degree in Mathematics from Dacca University around 
1945, and taught there briefly from 1947 to 1948. He moved to Calcutta in 1948, 
where he worked as an actuary with an insurance company for some time. In 
1950, he joined the Indian Statistical Institute as a research scholar under 
Professor C.R. Rao. In 1953, he submitted his Ph.D. thesis to the Calcutta 
University and went to Berkeley as a Fulbright scholar. His associations with 
Neyman at Berkeley and with Fisher at the Indian Statistical Institute in 1955 
gave him a deep insight into both the Neyman-Pearson theory as well as the 
Fisherian theory of ancillarity and conditionality. He knew and understood these 
paradigms better than most of his contemporaries. His critical examination of 
both the Neyman-Pearsonian and the Fisherian modes of inference eventually 
forced him to a Bayesian point of view, via the likelihood route. The final 
conversion to Bayesianism came in January, 1968, when Basu was invited to 
speak at a Bayesian Session in the Statistics Section of the Indian Science 
Congress. He confesses that, while preparing for these lectures, he became 
convinced that Bayesian inference did indeed provide one with a logical resolution 
of the underlying inconsistencies of both the Neyman-Pearson and the Fisherian 
theories. Since then, Dr. Basu became an ardent Bayesian and, in many of his 
foundation papers, pointed out the deficiencies of both the Neyman-Pearsonian 
and the Fisherian methods. 

Professor Basu was on the Faculty of the Indian Statistical Institute for 
many years. His passion for travel has taken him to universities all over the 
world as a visitor, e.g. UNC at Chapel Hill, University of Chicago, University of 
New Mexico, University of Sheffield, University of Adelaide, Iowa State 
University, to name a few. He was a Professor of Statistics at Florida State 


University from 1976 until his retirement in 1986. Throughout his professional 
career, he has maintained strong ties with the Indian Statistical Institute. Now 
in his retirement, when he is not abroad, he loves to return to the ISI to look 
around the classrooms, the flower-beds, and the rose gardens which he so 
painstakingly helped created during his association with the Institute. 

In his fruitful research career spanning nearly four decades, Dr. Basu’s 
emphasis has always been on the foundations and the underlying concepts rather 
than on the technicalities. In keeping with his philosophy, essays in this 
festschrift volume, dedicated to Dr. Basu on the occasion of his 65th birthday, 
place the major emphasis on the foundational issues of statistical inference. Most 
of the papers in this volume are review articles written by his friends and 
colleagues in those areas of statistics that have interested Dr. Basu most during 
his active research career. This monograph differs from other festschrift volumes 
in yet another respect. It is written in a narrative style which has typified so 
much of Dr. Basu’s own writings in statistics. We believe that this is a fitting 
tribute to a scientist whose simplicity of exposition has earned him a special place 
in the evolution of contemporary statistics. 

We take this opportunity to thank all the authors of this volume who 
spent so much time writing and rewriting their articles. We would also like to 
thank the referees (names arranged alphabetically): J. Berger, A. Bose, G. 
Casella, R. Christensen, L. Kuo, D. Lane, G. Meeden, R.V. Ramamoorthi, B.K. 
Sinha, J. Srivastava, and W.J. Zimmer for their selfless service. Special thanks 
are due to Professor Robert J. Serfling, Editor of the IMS Lecture Notes 
Monograph Series for agreeing to publish this collection of essays. The project 
would never have been completed without his active encouragement at different 
stages of its preparation. We also thank Jose L. Gonzalez, the IMS Business 
Manager for his valuable advice at the final stages of the preparation of this 
volume. 

Finally, we wish to thank Ms. Cindy Zimmerman for her patient and 
careful typing of all the manuscripts in a unified format. 


Malay Ghosh Pramod K. Pathak 
University of Florida, University of New Mexico, 
Gainesville Albuquerque 
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CONDITIONAL INFERENCE FROM CONFIDENCE SETS 


George Casella, Cornell University 


Abstract 


Ideas of inference using conditional confidence have grown out of many 
different schools of statistical thought. The development of these ideas is traced, 
starting with some original ideas of Fisher. The influence of other researchers, 
such as Basu and Buehler, is also discussed. The development is traced to the 
present, through the work of Pierce and Robinson, to current work in conditional 
inference. 


Introduction 


The development of conditional inference, in particular that based on 
confidence sets, has followed many paths. There are now several inferential 
methods that use this name. For example, the likelihood based methods of 
Hinkley (1980), or Cox and Reid (1987), are conditional inference methods. The 
attempt of Kiefer (1977), to merge conditional ideas with frequentist theory is 
also conditional inference. 

The one common factor in the different conditional inferences is the 
requirement of reasonable (coherent) post-data inference. That is, inferential 
statements made after the data have been seen should have some logical 
consistency. Another approach to conditional inference, one that gained structure 
through the work of Buehler (1959) and Robinson (1979a,b), provides an 
objective framework for assessing post-data validity. It is this version of 
conditional inference, based on confidence sets, on which we will concentrate. 

The different versions of conditional inference have a common origin in 
ideas of Fisher. These ideas of Fisher are somewhat intuitive, and leave some 
gaps in development (but not to Fisher!). The origins in Fisher were later refined 
by Basu, who relied on ideas of Bayesian inference to close any gaps. ‘This is 
where our review begins. 


1. The seeds of conditional inference 


Many influential ideas in statistics can be attributed to Sir Ronald 
Fisher. One of the most elusive, perhaps, is that of conditional inference. In 
Fisher (1959, page 78) we find the ideas of a reference set: 


This paper was written in honor of Professor D. Basu on the occasion of his 65th birthday. 
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In attempting to identify a test of significance --- with a test for 
acceptance, one of the deepest dissimilarities lies in the population, 
or reference set, available for making statements of probability. 


Interpreting Fisher, we find that he is concerned with the range of the 
inferences, that is, with the set in the population to which the inference should 
apply. In this sense, he is concerned with conditional inference, inference 
conditional on some subset of the sample space. The exact nature of his concern 
is not, at first, clear. It does emerge in some later statements, again from Fisher 
(1959, page 81). In talking of inference from Student’s t distribution, he says 


The reference set for which this probability statement holds is that 
of the values of u, Z and s corresponding to the same sample --- 


there is no possibility of recognizing any subset of cases --- for 
which any different value of the probability should hold. (my 
italics) 


In this statement we see one of the keystones of conditional inference. 
There should not be a subset of the sample space (a recognizable subset) on 
which the inference from a procedure can be substantially altered. If such subsets 
exist, then inference from the procedure is suspect. 

If such a recognizable subset existed, then Fisher would no doubt find it, 
however, there does not seem to be any general methodology used. Although 
ideas of estimating and eliminating nuisance parameters are used, and also ideas 
of ancillarity are used, no general scheme is defined. 

One famous example is Fisher’s criticism of Welch’s solution to the 
Behrens-Fisher problem. If T, sf, i = 1,2, are the sample mean and variance 
from samples of size n from independent normal populations with unknown 
parameters p; and o?, Fisher (1956) derived the following fact. Under the 
hypothesis Ho: #4}=H#, for any value t, 


Va( X,- ža) 


on > t| $= H |= Pl Tial > 74, (1) 


where T)/,,-1) has Student’s t distribution with 2(n—1) degrees of freedom, and T 


is an unknown parameter satisfying 0 < r < 1. Thus, conditional on = =- 6, 
the random variable m-s is stochastically greater than IT 2(n-1)/- 


Fisher used this fact to show that Welch’s solution suffered from the property 
that the probability of rejecting a true Ho, given that = = $y was pounce: 
below by the nominal level. Thus, on the recognizable siiboet {(s?,s2): sf = s2}, 
Welch’s solution has an actual error rate greater than the nominal level. 

This conditional behavior would be even more disturbing if the set 
{(s?,s3): s? = s2} is taken as a reference set, i.e., a set on which the conditional 
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inference should be applied. Fisher’s argument for conditioning on this set, or 
more generally on the ratio s?/ st, is elusive. The fact that Fisher considers this a 
reasonable reference set appears again in Fisher (1959), where he discusses his 
solution to the Behrens-Fisher problem. 

The fact remains, however, that the mechanism of choice of a reference 
set is elusive. Although concepts of ancillarity and elimination of nuisance para- 
meters are considered, a general mechanism for choosing a conditional reference 
set is not known. 


2. Basu’s refinement 


In doing conditional, or post-data, inference the evidential meaning of 
the inference becomes increasingly important. Fisher’s idea of a reference set has 
some meaning, i.e., it defines a part of the sample space on which inference is to 
be restricted. On the other hand, the connotation of a recognizable set does not 
carry this distinction. 

A recognizable set is only a set that is in the sample space, and may give 
no meaningful inference base. Poor conditional (post-data) performance of a 
procedure on a recognizable set is taken as criticism, but if this recognizable set is 
not a meaningful reference set, then the criticism may be vacuous. 

Fisher had the intuition to choose recognizable subsets that were also 
meaningful reference sets. Thus, when he leveled criticism (or praise) of the 
conditional performance of a procedure using a particular recognizable set, this 
set was also a meaningful reference set. One of the major clues left to us by 
Fisher, on how to chose these reference sets, is that they should use ancillary 
information. 

Alas, many of us are not possessed with Fisher’s intuition in choosing 
reference sets. When Basu started to think about this, he realized that basing 
conditioning sets on ancillary information was not, in itself, a reasonable 
technique in general. In Basu (1964, page 17, Statistical Information), he says 


The ancillary argument of Fisher cannot be extended ---. We end 
this discourse with an example where --- the ancillary argument 
leads us to a rather curious and totally unacceptable ‘reference 
set’. 


Basu then gives an example to illustrate his point. The point that we 
should be concerned with is that the choice of the reference set is not automatic. 
Of course, Basu does not give us a recipe for choosing a reference set, but rather 
argues that the only reasonable procedures are free of conditional defects. 


3. Conditional and unconditional inference 


Inference made conditional on the data must, necessarily, connect a 
statement about the unknown parameters to the data actually observed. This 
fact separates conditional confidence inference from unconditional, or pre-data, 
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confidence inference. This latter inference, that of the frequentist (Neyman- 
Pearson) school, need not apply in any way, to the data at hand. A frequentist 
inference merely states how the procedure will perform in repeated trials, even if 
such a statement is ludicrous in the face of the observed data. 

This dichotomy, between conditional and unconditional inference, most 
often results in a statistician choosing one stand and rejecting the other. Fisher 
rejected unconditional inference in favor of conditional. Basu, although starting 
in the Neyman-Pearson camp, ultimately rejected unconditional inference in favor 
of Bayesian conditional inference. Indeed, perhaps Basu stated his belief most 
elegantly in Basu (1981, page 173, Statistical Information) 


With E, as the (Neyman-Pearson) confidence set corresponding to 
the observed sample z, can any evidential meaning be attached to 
the assertion 6 € E,? Suppose on the basis of sample X one can 
construct a 95% confidence interval estimator for the parameter 9, 
then does it mean that (the random variable) X has information 
on ĝ in some sense? 


Of course, Basu gave examples of 95% Neyman-Pearson confidence 
intervals with no information at all about 9. For example, if 6 € [0,1], and 
X ~ U(0,1) (X is 6-free), then for any fixed set B C (0,1), the set 


B if0<X < .05 
E,=4(0,1) if0.5<X < .95 
Bo if 95 < X <1 


is a 95% unconditional confidence set for 6. But, of course, we cannot attach any 
evidential meaning to the statement “0 € E,” (We note, in passing, 
that the conditional behavior of this set is wretched. For example, 
P(0 € E,|0 < X < 05) = P(OEB) and P(0 e€ E |-95 < X <1) = 
P(0 € B°). One of these two, probabilities must be smaller than .95. 
Further, P(@ € E,|-05 < X < .95) = 1, showing that the post-data inference 
can be moved all over.) 

As we trace the development of conditional inference, we will see that 
Basu’s teachings are there. Many papers take the approach of verifying good 
conditional properties by verifying Bayesianity. However, this might be a case 
where some good can come out of greed. Why should we be satisfied with only 
good post-data behavior or good pre-data behavior? Why can’t we try for both? 
The answer is that we can not only try for both, we can sometimes attain it. 
The procedures that do can be acclaimed by both camps — conditional and 
unconditional. 
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Formalizing Conditional Inference 


The work of Buehler (1959) was a landmark attempt in examining post- 
data validity of Neyman-Pearson procedures. Buehler’s work is pioneering for 
two reasons. One, he examined post-data behavior of frequency based rules (not 
necessarily Bayes rules) and two, he developed criteria for carrying out this 
evaluation in an objective manner. Buehler’s work was based on other seminal 
ideas of Tukey (1958) and Stein (1961), and was ultimately generalized and 
formalized by Robinson (1979a,b). We briefly describe Robinson’s set-up. 

The random variable X has density fz|0) and, based on observing 
X = 2, a confidence procedure < C(z),y(z) > is constructed. A confidence pro- 
cedure consists of a set C(x) and a probability assertion y(z). The validity of 
y(x) as a confidence assertion is measured by the ability of <(C(z),7(z) > to 
maintain its confidence even when evaluated conditionally. To be specific, we 
consider y(x) to be an evaluation of the coverage properties of C(z) in the sense 
that 


Egy(X) ~ P0 € C(X)). (2) 


Suppose now that a recognizable subset, A, of the sample space, and an € > 0 
exists such that 


Eg(1(X)|X € A) - P9 E AXIXE A) 2 €. VO (3) 


Then, we have qualitatively changed the confidence behavior. On the set A, our 
conditional assertion is suspect: The asserted probability, y(z), is, on the 
average, uniformly greater than the actual conditional coverage. 

In Robinson’s terminology, (3) is a special case of a relevant betting 
function, defined as follows: 

Definition 1: A function k(x), -1 < kz) < 1 is relevant for 
< C(z),y(z) > if 


Eo {( 10 € AX- UX) UX) > ElKI (4) 


for all 6 and some € > 0. If e€ = 0, k(x) is semirelevant. 

For statistical purposes, the most interesting forms of functions k(x) are 
indicator functions. Such functions reduce (4) to forms like (3), and allow 
interpretations in terms of conditional coverage probabilities. If k(z) < 0 is 
relevant, it is called negatively biased. If k(x) = —I(X € A) then (4) would reduce 
to (3). Posittively-biased sets can similarly be defined. In the previously men- 
tioned criticism by Fisher of Welch’s solution to the Behrens-Fisher problem, 
Fisher identified a negatively-bzased relevant subset. 

Buehler and Fedderson (1963) identify, in a special case, a positively- 
biased relevant subset for the one-sample ¢ interval (they also attribute a similar 
result to Stein, 1961). Later, Brown (1967) generalized this result to any one- 
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sample t interval. For a random sample X,,-::,X,, from n( 1,07), Brown identi- 
fied constants K and c€ so that 


Aue X + t8| |X| /S< K) 2 l-ate Wyo, (5) 


where t is the cutoff yielding a nominal 1—æ interval. This can be interpreted as 
saying that the conditional coverage of the t interval, after accepting Hp: = 0, is 
uniformly greater than the nominal level. 

Identification of semirelevant subsets is less interesting than identification 
of relevant subsets, as most procedures with a frequentist guarantee will allow 
them. For example, from (5) we can deduce 


P(weX + t||X|/S > K) < 1-a Vuo? , (6) 


identifying a negatively-biased semirelevant set for the t interval. However, 
Robinson (1976) showed that the ¢ interval allows no negatively biased relevant 
sets. This led him to conclude that elimination of negatively-biased semirelevant 
sets was too strong a conditional criterion, but elimination of negatively-biased 
relevant sets was about right. (The elimination of positively biased sets is of 
lesser concern, as this corresponds to being conservative. However, note there are 
situations when this direction of error can be important.) 

An interesting set of papers are those by Olshen (1973), and Scheffe 
(1977) with a rejoinder by Olshen (1977). In the 1973 paper, Olshen established 
a result like (6) for the Scheffe multiple comparisons procedure. Specifically, 
Olshen showed that the conditional coverage of the Scheffe procedure, given that 
the ANOVA F test rejects Hp, is less than or equal to the nominal level. Thus, 
Olshen generalized Brown (1967) in one direction, identifying a negatively biased 
semirelevant set for the Scheffe intervals. Scheffe took exception to this criticism, 
and answered Olshen in the 1977 article. 

The connection between Bayes sets and conditional performance is very 
strong, as shown by Pierce (1973) and Robinson (1979a). If m(0) is a proper 
prior, and we define the pair < C%(z),y"(x) > by 


y"(z)= f xO 2)d6, (7) 
C™ (z) 


where x(0| 2) = f(z|0)(6)/ J Az| 0)x(0)d0, then no semirelevant functions exist 
for < C” (x), y" (xz) >. Thus, proper Bayes procedures have the strongest possible 
conditional properties. 

Although the connection between Bayesianity and conditional perform- 
ance is very strong, the exact link has not yet been established. That is, 
necessary and sufficient conditions for elimination of relevant, or semirelevant, 
functions have not yet been established. Although the work of Pierce and 
Robinson, and also Bondar (1977), establishes links between (possibly improper) 
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Bayes procedures and nonexistence of relevant sets, the ultimate theorem, giving 
a necessary and sufficient condition, is still not known. The answer, although 
still unproven due to mathematical technicalities, seems to be that elimination of 
relevant functions will occur if and only if the procedure is a limit of Bayes rules. 
Another step in establishing this connection was taken by Casella and Robert 
(1988), but the full answer remains an open question in the conditional inference 
literature. 


Frequentist Conditional Inference 


Although proper Bayes rules have strong conditional properties they do 
not, in general, have good frequentist properties. Even Bayes rules based on “flat 
priors”, such as a Cauchy, which may exhibit some acceptable frequentist per- 
formance, cannot maintain a frequentist confidence guarantee. This is a property 
shared by Bayes credible sets based on proper prior distributions (Hwang and 
Casella, 1988). However, limits of Bayes rules, or generalized Bayes rules, can 
maintain a frequentist guarantee, and such procedures may also have acceptable 
conditional properties. It is within this class that we can find procedures that 
have acceptable frequentist (or pre-data) properties and acceptable conditional 
(or post-data) properties. 

A confidence set, C(x), is a 1-a frequentist confidence procedure for a 
parameter @ if 


Pdo e AX) > 1-a for all 8, (8) 


that is, the unconditional coverage probability of C(z) is at least 1-a. Of course, 
this pre-data guarantee says nothing of the conditional performance of the 
procedure < C(z),l-a>. Robinson was able to establish conditional properties 
for several frequentist procedures by using the fact that they are limits of Bayes 
rules. In particular, his results for the tinterval (Robinson, 1976) rely on this 
fact. Other results (Robinson, 1979b) for frequentist intervals for location or 
scale families also use arguments based on limiting Bayesianity. Most condi- 
tional properties of limits of Bayes rules deal with relevant, rather than 
semirelevant, functions, and the existence of € > 0 becomes important in the 
limit. However, for certain procedures from location families, Robinson (1979b) 
established the nonexistence of semirelevant functions. In particular, if 
X ~ f(z-6), then the procedure 


< [r-c,z+c], 1-a > , 


c (9) 
l-a = f Kòdt, 


—~C 


is a l-a frequentist confidence procedure that allows no semirelevant functions. 
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Using different arguments based on invariance, Bondar (1977) established 
conditional properties of invariant frequentist sets. 

The issue that is at the heart of the frequentist/conditional dichotomy is 
the assignment of a confidence function to a set C(z). For example, for any set 
C(z), where X ~ f(z| 6), if we define (z) by 


f K2|0)x(0) a9 


O(z) 


y(x) = T ODOL: ’ (10) 
6 


where m(0) is a proper prior, then the procedure < C(z),7(z)> is free of 
semirelevant sets. However, if C(z) is also a 1-a frequentist confidence 
procedure, this argument does not imply any conditional properties of 
< C(z),l-a >. Thus, this type of consideration leads to two questions: 


i) Is < C(z),y(z) > a reasonable frequentist procedure? 


(11) 


ii) Is < C(z),l-a > a reasonable conditional procedure? 


Since the work of Robinson, and the others, in the 1970s there has been 
some progress made on the questions in (11). In Casella (1987) it was argued 
that, with some regularity conditions, a sufficient condition for the frequentist 
procedure < C(z),l-a> to be conditionally acceptable is the existence of a 
(possibly improper) prior 7(@) such that 


x | 6)(0) d6 
= J olaf a > 1-a for all z. (12) 


Ware KODOLS 


If (12) is satisfied, then the procedure < C(z),l-a > allows no negatively biased 
relevant sets, which is acceptable conditional performance. Furthermore, it was 
demonstrated that such a property held for the multivariate normal confidence 
set centered at the positive-part James-Stein estimator. Specifically, if 
X ~ N(0,I), a p-variate normal random variable (p > 3), then the confidence 
procedure < C;(z),l-a > allows no negatively-biased relevant sets, where 


+ 
Cs(2) = {0: |0-6(2)| < c}, 6(2) (1-25) r, 


Ax? < c) = l-a . 


Such a conditional inference strategy was also promoted in Casella (1988), 
and some other procedures were also examined. In discussing this paper, a 


CONDITIONAL INFERENCE 9 


number of alternate strategies were put forth. For example, Berger (1988) advo- 
cates an “estimated confidence” approach, where the procedure < C(z),7(z) > 
would be considered frequency valid if 


Eoy(2) < P0 € C(2)), forall, (13) 


i.e., on the average, the confidence assertion is conservative. Lu and Berger 
(1989a, b) have applied these ideas to Stein-type problems. Most recently, 
Brown and Hwang (1989) have shown that for the confidence set [z-c,z-+c], where 
X = 2 is an observation from f(z-9), the confidence procedure < [z—c,z+c],1—a > 
is admissible, where 1-æ = f _, f(t)dt. The admissibility is with respect to the 
class of confidence procedures < [z-c,z+c],7(z) > (fixed c), where y(x) satisfies 
Egy(z) < 1-a (frequentist validity) and the loss function is L.(6,7(z)) = 
(7(2)-1(8 € [2-¢,2+¢]))”. 

Another alternate strategy was described by Lindsay (1988), who 
suggested attaching both a frequentist and conditional confidence to a given set 
C(x). Although this is a sensible approach, it is probably the case that 
practitioners are more comfortable with one number for a confidence assertion. 
Thus, this reasonable solution might not find acceptability in practice. 

Returning to the questions posed in (11), we might now ask what is the 
reasonable requirement for the confidence assertion to be attached to C(z). 
Considering the theories of relevant sets, and how confidence sets are used by 
practitioners, the following strategy seems most reasonable. For a set C(z), 
assert confidence y(x) where 7(z) satisfies (10) for some (possibly improper) prior 
m(@). This strategy assures us that < C(z),7(z) > is conditionally acceptable. 
Moreover, we require that y(r) be valid as a measure of frequentist confidence. 
Ideally, we would require that y(z) satisfy (12), which not only renders 
< C(z),7(z) > frequency valid, but also yields the conditional acceptability of 
< C(z),l-a >. However, condition (12) may not always be attainable and, in 
such a case, we would settle for y(x) satisfying a condition such as (13). This 
would give some frequentist acceptability to the procedure < C(z),7(z) >. 

If neither condition (12) nor condition (13) can be attained by a y(z) 
satisfying (10), then frequentist acceptability may have to be compromised. The 
frequentist guarantee of the procedure < C(z),y(z) > may then be based on 
quantities such as Egy(X), ming Egy(X), or min, y(z) (as long as these last two 
quantities are positive). The point should be clear. A guaranteed legitimate 
conditional inference is of primary importance. After that, the frequentist 
guarantee should be arrived at in some reasonable manner. 

These ideas have been investigated, in different forms, by Maatta and 
Casella (1987), Goutis, Casella and Maatta (1989), Goutis and Casella (1989) for 
estimating a normal variance, and Hwang and Casella (1988) for estimation of a 
normal mean. 
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Discussion 

The ideas behind conditional inference are deep, and here we have 
superficially sketched one line of work stemming from the developments of Fisher 
and Basu. There are many ideas in their work, both implicit and explicit, that 
have not been mentioned. (For example, Basu is an advocate of the Likelihood 
Principle; and recent work by Casella and Robert, 1988, suggest that violation of 
this principle immediately leads to the existence of relevant sets.) However, the 
ideas of conditional inference play an important role in statistics. 

Although it might be argued that searching for relevant sets is an 
occupation only for the theoretical statistician, we must remember that practi- 
tioners are going to make conditional (post-data) inferences. Thus, we must be 
able to assure the user that any inference made, either pre-data or post-data, 
possesses some definite measure of validity. 
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INTERVENTION EXPERIMENTS, RANDOMIZATION AND INFERENCE 


Oscar Kempthorne, Department of Statistics, lowa 
State University, Ames, lowa 


Abstract 


This essay gives a discussion of processes of design and analysis of a 
study of the effect of two or more interventions or treatments on a set of 
experimental material (e.g., an agricultural area, or a set of mice, or a human). 
The problems of design, which includes, critically, the plan by which treatments 
are conjoined to experimental units, and of analysis are discussed. The author 
suggests that everything be based on randomization, both design and analysis by 
randomization tests and inversion thereof. The problem that usual conventional 
randomization gives bad plans is discussed and suggestion made to overcome it. 
Parametric models are not used, so defects in conventional parametric inference 
do not arise. Discussion is given on subjectivity and objectivity. 


Introduction 


The term ezperzment is commonly interpreted to mean a variety of 
activities. It can mean nothing more than observation of a piece of space-time; 
e.g., observing the moon by sending a moon shot. It can mean making a piece of 
material and measuring attributes of this piece. It can mean doing a study to 
attempt to determine the effects of a treatment protocol on a disease in humans. 
It is not entirely unusual to refer to a study estimating an attribute of a defined 
population such as the human population of the United States as an experiment, 
though most statisticians would say that such a study is a survey. Then we have 
the writings of theoretical statisticians that an experiment is a triple (X, A, P(@)) 
where X is a sample space, A is an algebra of subsets of X and P(@) is a set of 
probability measures indexed by a parameter ð. 

I have taken the position that there is a case for distinguishing three 
types of experiment with associated types of inference that I named sampling, 
observation and experimental (Kempthorne, 1979). 

In the sampling problem, there is a real existent population, say, the 
totality of human beings of the United States. | Each individual has 
unambiguously defined attributes, such as age, height, weight, amount of 
education and so on. The problem is very simple to state and to understand; 
namely, what is the frequency distribution of an attribute in this real population? 
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It is easy to imagine having a huge army of enumerators — measurers, so that 
every human is located, enumerated and measured. The inference problem in 
this case is also obvious: as a simple example, there is a population of ages, and 
this population has a mean. An inference problem is then to obtain data and 
then to make useful statements about the unknown mean. 

In the observation problem, we observe a whole population, but we hope 
and wish that this population that we observe is representative of a much larger 
population. Our explorers on the moon observed a portion of the surface of the 
moon over a very brief period (hours, I imagine), but the hope is that the 
observations are more or less typical of what would be observed over an extensive 
time. Similarly, we hope that our observations of planet Earth relate to its 
status over a significant period, e.g., years, decades, or centuries, etc. We are 
currently concerned about the ozone layer and wonder what its status will be in, 
say, 20 ot 50 years. Obviously, to speculate about this, we must have 
observations at a few times and validated dynamic model of how the status 
changes. So, then, in the observation problem, we must have a model that 
represents what we hypothesize about the unobserved world, unobserved because 
it is in the past or in the future, or at present and not looked at. 

In the present essay, I wish to address solely the third class of problem, 
which is easily exemplified. Let me give some examples. Atherosclerosis of the 
heart is a common enough problem: rather worrying, I am sure, and I know. 
How should this be treated? There are treatments by drugs, by diet, etc., and 
there is one treatment that is rather heavy — heart bypass surgery. There is then 
an obvious question. Is it a good idea to treat the sick person with bypass 
surgery? Other heavy questions arise with the disease of cancer in humans. 
What treatments are effective, which treatments are better than other 
treatments? The nature of situations of this sort is that we have a problem 
developing under its own dynamic, and the question is of what intervention will 
help. 


The Intervention Experiment 


A rather generally accepted, and, I imagine, not to be challenged, partial 
model is that we have materiel and a set of interventions. The partial design of 
the experiment is to partition the experimental materiel into pieces and then 
place one of the interventions on each piece of materiel. The branch of statistics 
called the design of experiments was started by R. A. Fisher at the Rothamsted 
Agricultural Experiment Station. The materiel was agricultural land, planted 
with certain crops such as wheat, or mangolds, or grass, etc., which was 
partitioned into pieces called plots, and the treatments were various agricultural 
interventions such as nutritional supplements. An example that seems 
superficially quite different is a psychological experiment in which the materiel is 
part of the life of a human subject for example, the 6 days of a week, and the 
pieces are human-days. The treatments could be various drug regimes. The aim 
of the experiment might be to palliate depression, for instance. 
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The performance of the experiment consists of the following steps: 


(i) defining the problem, which will consist of specifying the experimental 
material and specifying the interventions (treatments) that are to be 
compared; 


(ii) dividing the experimental material into plots, each of which is to 
receive a treatment; 


(iii) deciding how to conjoin the set of plots and the set of treatments, 
taking into account the totally obvious fact that a plot can receive 
only one of the treatments; 


(iv) letting the experiment proceed to the prechosen termination point; 
e.g., the point of harvest of an agricultural crop, or recovery or Judged 
failure of a medical treatment; 


(v) taking measurements that are thought to be relevant to the problem; 


(vi) analyzing the resultant data: I put the word analyzing in quotation 
marks because this is by no means a well-defined operation; and the 
drawing of conclusions, with the same obscurity; 


(vii) discussing usefully how the conclusions can be extended to what is 
often called the target population. 


The “Design” of the Experiment 


It is commonplace among statisticians who actually work with real 
investigators (not individuals who only write about the design of experiments) to 
consider all three of steps (i), (ii) and (iii) as critical components of the design of 
the comparative intervention experiment. Both adjectives comparative and 
intervention are essential. 

It is useful, I think, to mention for comparison, the type of study in 
which the outcome is thought or modelled to be a realization of a random 
variable, X say, which is distributed according to a distribution determined by 
some control variables, say z, and indexed by some parameter 9, where z and 8 
may be vectors. Such a study is purely mathematical. 

It is rather obvious, at least by hindsight, that a natural field for 
thinking about the comparative intervention experiment is farm agriculture or 
garden agriculture. Suburbia consists mostly of houses on individual lots with 
associated grassed areas — commonly called lawns. Almost all suburbanites 
experience problems with their lawns. The grass is thin, is dying or has died. 
What should be done to obtain a lawn that is good looking? What interventions 
should be made? In trying to teach the design of experiments I have often used 
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this problem as an example. It is not at all surprising that the formulation of a 
set of procedures for the experiment was done at the Rothamsted Agricultural 
Experiment Station. The beginning of experimental agriculture was made by 
Lawes and Gilbert in, say, 1843. The most famous Rothamsted experiment is, 
surely, the Broadbalk field experiment on wheat which was started in 1852 and 
has continued to present time. The field, Broadbalk, was divided into 13 plots 
for different nutritional treatments. The yields of wheat were analyzed in a 
certain way by Fisher (1921). Later Fisher (1924) gave a data analysis of the 
yields (or years 1852 to 1918) attempting to determine the influence of rainfall on 
yield. 

The use of intervention studies obviously goes back for centuries or 
millennia — humans found that eating certain plants was harmful or even fatal. 
It was only in this century that a partial logic was developed. 

That the design and analysis of intervention experiments did not 
originate in connection with human nutrition or human medical problems is not 
surprising, perhaps, because the comparative intervention experiment requires 
conjoining one of several treatments to each experimental unit, e.g., human. 
There were obviously no ethical problems in treating a plot of land with one of 
several treatments. 

There was the recognition that there was variability between 
experimental units that received the same treatment, and it was obvious that this 
variability was not the result of measurement error. The existence of such vari- 
ability was exhibited completely by the various uniformity trials that were 
conducted, after agricultural scientists recognized that there were problems of 
design and of analysis. 


The Field Plot Experiment 


Suppose that our initial problem is that of Lawes and Gilbert in 1843. 
We wish to determine the effectiveness of several nutritional treatments for 
wheat. We realize that the yield of wheat grown under the same regime varies 
over England. Obviously, the yields at Rothamsted will not be the same as the 
yields in Cornwall or even on a farm 5 miles from Rothamsted. We are able to 
perform the experiment at Rothamsted and have the field Broadbalk to use. 
Then, obviously, we can hope only to determine somewhat the effectiveness of the 
treatments on Broadbalk field of Rothamsted. We realize that we can only, at 
best, determine the differences among treatments as measured on Broadbalk field 
in year, say, 1852. Suppose that we can determine these differences exactly. 
Then to apply the results to what will happen elsewhere and in different years 
(e.g., 1990), the only process we can use is to assume that the treatment 
differences will be the same or that the differences are related to some variables 
that are known for the other circumstances. 

This thinking leads me to a view of the fundamental problem of what we 
might (but should not necessarily) call experimental inference. I state this in 
very simple form: 


INTERVENTION EXPERIMENTS 17 


We have a collection, a set, of experimental material. We have a 
set of interventions or treatments. Our task is to form judgments 
on the effects of the treatments on this collection of material. 


The extension of conclusions to some larger set of material is a problem I 
shall not address. I merely make the comment that making the assumption that 
the material used in the experiment that is performed is a random sample from 
some large population of material is unjustifiable, though perhaps the only way 
to make even a guess. 

I shall discuss agronomic field experiments later, but I first wish to 
consider what I call experimentation on a line. 


Experimentation “On A Line” 


Suppose we have an oil processing plant with an inflowing pipeline of 
feed stock. We wish to examine the differential effects of some treatment 
processes; e.g., the use of different catalysts. Then our procedure will be to take 
time slugs of the input and treat each slug with one or other of the treatments. 
We shall use time slugs that are separated by intervals necessary to make the 
alterations in the processing and to allow the processing to reach equilibrium 
status under each given treatment. 

As a result of such considerations we shall have experiment time slugs 
that can be indexed by 1, 2,..., the integers. Suppose now that we have 4 
treatments, say A, B, C and D, and we have decided to use 20 successive time 
slugs. Then the question must be faced of how we are to assign A, B, C, D to 
the slugs. An obvious suggestion is to use the sequence ABCDABCD... but only 
a fool would do this. Why do I say this? There will be undoubtedly a time 
trend in the nature of the feed stock and one would expect there to be variation 
around the time trend. I put the words time trend in bold because I find it 
difficult to find another term. One would expect that if one made a uniformity 
trial, thereby using only one treatment — say A, that the difference squared 
between observations on different time slugs would depend on the distance 
between the time slugs. In the particular example I am using, the uniformity 
trial will have been given by preexperiment records. 

There would be no computational difficulty with any treatment 
assignment in using a linear model, 


Y2 = Tij T & 


where 7,, is the effect of treatment in slug 7 and e; is the error, and then to 
assume What the set {e, is a realization of 20 independent random Gaussian 
variables that have mean 0 and variance o* (unknown). From even an 
elementary first course in statistics one can set this up as a Gauss Markov 
Normal Linear (GMNL) model, do the ANOVA, make the usual tests of 
significance, set up the usual confidence intervals, etc. 
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The experimental scientist with even minuscule understanding of 
variability should object to the plan — the treatment assignment above and the 
ensuing analysis as given by the usual elementary procedures in the attempt 
statement of precision of estimation of the differences between treatments — for 
the simple reason that treatments A and B are contiguous, treatments A and C 
occur at points that are apart by 2 units, and A and D are contiguous half the 
time and apart by 3 units the other half. So one would expect the difference 
between treatments A and B to have lower variance than that between A and C. 

What then should be done? It is a standard cliche of the design of 
experiments that one has to contemplate analysis to evaluate designs. It is less 
standard (and even not accepted by some) that the proper analysis (2f there ts 
one, and this is by no means sure) is determined to a considerable extent by the 
design. 

Suppose that one has used the treatment assignment stated above; i.e., 
ABCDABCD...ABCD. At the end of the experiment, one has observations y,, 
Y2- Yo9: How should one “analyze” the data? I imagine that 10 statisticians 
would produce perhaps 5 different analyses. There is the obvious one mentioned 
above. A second one would be to note that the whole sequence is made up of 5 
blocks each containing the 4 treatments A, B, C and D. Then to compound the 
naivete, the statistician could say that he is doing a randomized block analysis, 
though this can reasonably be characterized only as a block analysis. But why do 
this? Such an analysis ignores almost completely that the units are on a line. 

Why not consider the model 


Yi = Po + Pitt Te +e 


or 
V= tT yt e 


where f; is some function of i (e.g., a quadratic or higher degree polynomial) and 
e; is a term that is called error? 

The range of possible models with regard to the systematic part — the 
non-error part of the model is huge. In our little case, it is just the number of 
functions definable on the set of 20 values of i. It is perhaps of interest to 
mention that I remember with vividness being given a set of data of an 
experiment like this and the task of analyzing the data when I had completed a 
bachelor degree in mathematics at Cambridge. I was scared stiff — petrified, 
then. After many decades of being comfortable with the standard programs of 
statistical methods, I find I am again scared, except when randomization is used. 

An aspect of standard statistical methods that should cause questioning, 
but seems not to, is the nature of error. What is this error that statisticians talk 
and write about? One part of error is error of measurement, and this is very easy 
to understand. We have a process of measurement and often, or always in our 
imagination, we can measure without affecting the object or entity being 
observed. We assume without questioning, it seems, that individual unknown 
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errors of measurements are independent realization of a scalar random variable. 
With this mode of thinking, it is natural to think of a large number of 
measurements of the entity being measured, and that the error of a particular 
measurement is the deviation of the result from the average. Curiously then, this 
error is conceptualized by means of what would be observed with repetition, with 
what might have happened — a notion objectionable, it seems, to Bayeszans. 

In a real experiment with the usual nature of experimental units, there 
are, in fact, differences between the units, and there will be differences between 
units in the absence of measurement error, with the same treatment, as we would 
observe in a uniformity trial. These are called plot errors or experimental unit 
errors. Is it proper to use the term error for such variability? 

Suppose for definiteness that I wish to quantify the result of applying a 
treatment to 2 units: I do the experiment and I obtain 2 numbers y, and y. Is 
the difference between y, and y, an indication of error in this little study? We 
learned in our elementary statistics the role and importance of replication. I 
suggest, however, that we, including our founding fathers, have not thought out 
and told us what replication is. It seems easy and unquestionable that 
replication consists of repetition under constant circumstances. But we never 
have constant circumstances. Perhaps nearly so in a chemical or physical 
laboratory but not in, say, interventional research on humans. Fisher (1937, 
Sections 25 and 26) gives an interesting and relevant but not totally convincing 
discussion. In the case of the agronomic field experiment, he says that the 
problem of the impossibility of testing two or more treatments in the same year 
and on identically the same land can be overcome by testing the treatments on 
random samples of the same experimental area. Perhaps this will make my 
doubts seem reasonable. In the case of a field experimental area that is divided 
into parts, 2 plots are the same only if we agree to say this, and if we look at 
them sufficiently carefully, they will be found to be different. So it seems that we 
never have what may be called real replication in any sort of intervention 
experiment. This seems almost an absurd line of thought. We can have 
replication only in the sense of repeating a set of operations (e.g., of baking a 
cake). 

A natural model to characterize the variability of the observations is to 
assume that the errors, e,, i = 1(1)20, are a realization of a short section of a 
time series; e.g., a moving average process or an autoregressive process. However, 
one can surmise that the choice of a parametric class of models and subsequent 
fitting will be difficult. Finally, the assessment of uncertainty in treatment 
effects will be difficult. It is curious that methods based on such ideas have not 
been well developed and used. 


The Fundamental Problem of the Intervention Experiment 


I take the basic common structure of the intervention experiment to be 
that we have units of material that we index by 2, and we have interventions that 
we index by 7. If we conjoin unit i and intervention j, we obtain an observation 
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Yi The fundamental problem is that we cannot determine how the observation 
Yi; is caused. We cannot conjoin more than one treatment with unit è. If, for 
instance, we could observe y,, and 4,5, we could conclude that the effect of 
treatment 2 minus the effect of treatment 1 on unit 1 is y,. — Yı- We shall 
observe, say, ¥,, and Yọ. Then the difference yj. — y,, can be attributed equally 
well (and equally badly) to this difference being the effect of treatment 2 minus 
the effect of treatment 1 or the effect of unit 2 minus that of unit 1. It is obvious 
that we have to deal with a set of units, some of which receive treatment 1 and 
some treatment 2. Suppose then we observe in a small experiment 


V11 = 10, Yoo = 15, yg, = 13, y42 = 20. 


We are inclined to view that the effect of treatment 2 minus the effect of 
treatment 1 is 


1(15 + 20) - 4(10 + 13) = 6 


But we can equally well conclude that this difference should be attributed 
to 
(unit 2 + unit 4) minus (unit 1 + unit 3) 


In fact, the size of the experiment is irrelevant to the difficulty. If we have 
treatment 1 on 1,000 units and treatment 2 on a different set of 1,000 units, 
whatever mean difference we observed can be equally well attributed to difference 
of effects of treatment 1 and treatment 2 or the difference between the 2 sets of 
units. 

This leads to the absurd conclusion that we cannot determine whether 
any intervention produces some effect. Obviously the conclusion is false. What 
has often enabled the conclusion that an intervention is, e.g., successful, is a sort 
of empirical Bayesian reasoning. If, for instance, in the past all humans who 
have contracted a disease subsequently died, and one individual who contracted 
the disease and received an intervention survived, then one concludes that the 
intervention was successful. It may be, of course, that there is something unique 
about the individual and, thus, the intervention has not produced the successful 
outcome. One guesses that most so-called quack remedies have come about by 
this route. 

This procedure is, of course, the method of historical controls, which has 
been very successful in many contexts. The method has been successful when the 
result produced after intervention is hugely different from the historical record. 

The first act that must be considered in thinking about an intervention is 
to ask what the historical record is without intervention and with intervention. 
Such questioning is usual, of course, in the case of treatment for illness, especially 
when the intervention is not reversible or removable; e.g., in a partial gastrec- 
tomy. In many situations with a new intervention, there is no historical record 
of the outcome from it. In many cases, the outcome without intervention and 
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with intervention is very variable. Insofar as there is a historical record, it is 
imprecise and exhibits variability. It would then be very difficult to determine a 
historical control. 

Even though the idea of a historical control is very appealing, there is a 
very difficult problem of deciding whether a proposed historical control is 
appropriate. What indeed makes a historical record relevant to evaluation of 
proposed intervention? In raising this question, I am thinking about 
interventional studies in connection with human illness and disease. We are told 
frequently that an attempt to determine if an intervention helps must incorporate 
its own controls. An exemplar case in which controls must be included in the 
experiment is that of agricultural research; for example, evaluation of a 
nutritional treatment on farm animals or farm crops. 

Holland (1986) has written very informatively on the general problem I 
am discussing. 


Design and Analysis 


These are surely interrelated. The quality of a design can be determined 
only by means of the method of analysis and the quality of the conclusions. So 
the first step in considering design must revolve around the method of analysis. 

The first step in standard theory of data analysis is to assume that the 
data D are a realization of a random variable X that has a distribution function 
Fy, which depends on a parameter 0. The next step is to determine if the data 
are in agreement with a particular value 0p. 

This step in Neyman-Pearson-Wald theory is to construct a rule for 
rejecting the hypothesis that 6 = 0). This rule is to have the property that the 
probability under the model that it rejects 0 = 0) when @ is in fact 0, is some pre- 
chosen a. Then, with this done for every ĝo, the values of 6 that are not rejected 
by this rule are said to constitute a (1 — œ) confidence set for the unknown ð. 

Related to this process, but different from it, is the use of significance 
levels, often called P values. Inversion of the whole family of related significance 
tests of 6 = 0o for a set of values of 65 gives a region of values of @ that agree 
with the data to a designated extent. 

My preference is to regard the regions so obtained as consonance regions, 
regions that specify values of 0 that are consonant with the data at chosen levels. 

These procedures, however characterized by particular words, do not give 
probabilities of hypotheses such as probability that @ belongs to any chosen 
region of the parameter space. 

If, then, the aim of the whole exercise, design, performance and analysis 
of the experiment is the obtaining of such probabilities, the procedures are totally 
unsuccessful. 

The group of statisticians known as Bayesians take the position that the 
aim of all investigation must be the obtaining of such probabilities. Then it is 
obvious that one can reach the result with the introduction of a prior 
distribution. Unfortunately there is no logic that forces choice of a prior. It is 
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the conclusion of this line of development that the probability outcome is a belief 
probability that depends critically, obviously, on the prior belief probability. 

My opinion is that the processes of science and technology do not require 
belief probabilities. The processes of science and technology require the obtaining 
of data under circumstances chosen by the investigator, and analysis of the data, 
which consists of making judgment of whether the data are consonant with 
particular models suggested by previous investigations or of determining new 
models from the data that are obtained. The idea that one has a realization 
from the holy trinity (to use a phrase of Basu) is simply ludicrous, so ludicrous 
that I can only suggest that those who base their ideas of learning about the real 
world, its present position and its dynamics have no experience of the nature of 
the processes one must use. One never knows the model! Did Newton know of 
the inverse square gravitation law? I say, “Obviously not”. He and other 
scientists knew that motion of the planets was elliptic — they knew this by 
observation and data analysis. The Bayesians write as though the past workers 
knew that the law of force was d”, where d is the distance and y is a parameter, 
and that they also had a belief distribution or a prior distribution on y. 

Another example that comes to my mind, though I have no depth of 
understanding, is the nature of the universe. It is expanding it seems, but will it 
continue to do so, or will it stop expanding or stay as it is, or start contracting 
and reach the size of a golf ball, or something even smaller? ‘The idea that 
analysis of astronomical data should use a parametric model determined by some 
6 with a prior belief distribution on 6 seems to me to be an antithesis of scientific 
method. 

I therefore take the view that the Bayesian prescription, which is being 
heavily touted as the prescription by which all the uncertainty about this world 
in which we have to live can be handled, is not worth considering. The 
prescription is very beautiful in its simplicity and its power. There are many nice 
theorems in its theory. But it is based on assumptions and ideas that cannot be 
validated. It is true, of course, that any reasonable prior will be overcome by 
data eventually if the data come from an unvarying stochastic process. This, 
however, is essentially useless in that (a) any individual has a finite life and (b) 
the models that are consonant with past data change with new data. A critical 
process of science is the determination of a model that is consonant with all data 
accumulated in the past and then challenging that model, which is done only by 
new ezperizments and determining if predictions from the old model are realized in 
the new experiment. The lesson of science of the past century is surely that the 
models of yesteryear, while having predictive value for circumstances under which 
they were developed, are found to fail. It follows then that evaluations of 
goodness of fit (e.g., of the question of whether a prediction and the actual 
realization agree) is an essential element of science. It is, of course, an essential 
element of decision making. Where does the particular (%, A, P, 0) come from? 
The very neat presentations start off with the assumption that this is known. 
How silly this is! I think I have said enough. 
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Randomization “Inference” 


I have tried to communicate my opinion that the usual frequentist theory 
and Bayesian theory, which purport to address the problems of inference and 
decision making, are failures. The failure of frequentist theory is not as deep, 
because it does recognize, though not at all adequately, that a stochastic model 
for a particular situation is a pure invention, which must be discovered, checked 
out and validated by means of real world data. 

It is useful, perhaps, to discuss the matter of subjectivity and objectivity, 
which seems to require discussion forever (see, for example, Berger and Berry, 
1988). The background seems to be the perception that Neyman-Pearson-Wald 
theory claims to be objective in contradistinction to Bayesian theory, which is 
subjective. ‘The described polarity is partly fake. The real story is that both 
theories qua theories are theories, and neither is subjective or objective, just as a 
theory is not heavy or light in the sense of weight avoirdupois. 

The only question is whether the practical use of either of the two rival 
theories is subjective or objective. My answer to this is that the NPW theory is 
partly objective in that the statistical models it uses must be confronted by the 
associated data, even though theory books say nothing about this. Practitioners 
of Bayesian theory (if there really are any) seem to pull their models and their 
prior distributions out of thin air but obviously do not. They do, however, make 
beliefs an absolutely essential component of their procedures, and any reasonable 
use of language must characterize the introduction of beliefs as subjective. 

In the Bayesian framework, the conditional distribution of the supposed 
random variable given the parameter value is checkable. If for instance X|@ is 
N(u, 0), we can check this by looking at a normal plot. If, however, we wish to 
adjoin to this the assumption that @ is N(v,¢*), how are we to check the 
appropriateness of this assumption? Someone else could declare that he would 
like to assume that 6 is distributed Cauchy (or whatever). Even more simply, 
where do v and ¢? come from? The fact that the values seem not to matter 
(but, of course, they do!) gives me no comfort, and I think I am not alone. 

The obvious conclusions that should be drawn from the objective- 
subjective polarity that seems to be necessary are twofold: 


(a) use of NPW theory requires data confrontation, which is not discussed in 
any theory book but uses use portions of general distribution theory and, 
obviously, significance tests to make such confrontation; 


and 


(b) use of Bayesian theory requires data confrontation, but this is not 
discussed in any exposition of the theory — I include exposition by any of 
the purported founding fathers. I shall not give references. Let any 
reader of this essay pull his (her) favorite exposition and examine it with 
respect to what I am discussing; in fact, I think, Bayesian users use the 
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distribution theory and significance tests that NPW users use; finally, 
Bayesian theory is subjective in that a prior is plucked out of thin air or 
quasi-derived by theory which itself is not validated for use even though 
based on axioms that seem (but are not) unchallengeable. 


The story really is that NPW theory is the half-clothed emperor while 
Bayesian theory is the emperor without any clothes. 
My discussion does not include empirical Bayes procedures which depend 


on data analysis in the choice of constituent distributions and face the same 
difficulties as NPW theory. 


Where Do I Come Out? 


I have given my views about the general mix of NPW decision theory 
and Bayesian theory. It seems to me that there are huge lacunae or gaps 
between the currently available theory and needed applications. 

I now turn to the intervention experiment problem. My perception of 
the history is that our founding father, Fisher, recognized almost all the problems 
that I have mentioned, but was not as explicit as he could have been. 

I am of the opinion that the assumption in a comparative intervention 
experiment that the outcome is a random variable from a probability distribution 
of a family of distributions indexed by some parameter of interest is not 
supportable. 

So the question then is: Can anything be done? An answer is that 
something can be done; namely, use randomization in the conjoining of units and 
treatments and then use tests of significance (= tests of consonance) that are 
based on the frame of reference induced by the randomization process used. 

Obviously, I am of the opinion that tests of significance are useful. If 
one regards them as useless, one is, it seems, in the position of being unable to 
determine objectively that a data set is not consonant with a particular model. 

The value of randomization and the randomization test of significance in 
the randomized intervention experiment is that the probabilities that arise in the 
justification are not belief probabilities but are frequency-in-repetition 
probabilities determined by the randomization process used. 

I think that most of the criticism of use of P values comes from a literal 
interpretation of Neyman-Pearson theory with its accept-reject rules and its type 
I error. Such a test of 6 = 0, say, carries with it the idea that 0 may really be 0. 
In the significance testing outlook the achieved significance level is a measure of 
strength of evidence against the hypothesis 6 = 0. Use of the significance test of 
0 = 0 carries no implication that 0 may be exactly zero. Also, no one should 
have a strongly different outlook if P were 0.049 rather than 0.051 as Neyman- 
Pearson theory suggests. 

The determination of confidence intervals or regions or, as I prefer, 
consonance intervals or regions for a parameter @ requires a formulation of how 
results resulting from @, will differ from results from 6,. In the case of 
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intervention experiments, the idea is used that if a unit with intervention j gives 
a result of y then with that same unit intervention j’ would give the result 
y+ (7, — 7;)- 


What Randomization Process to Use? 


This is, I judge, the basic question to be addressed. I think it was not 
addressed properly in past years. In experimentation on a line with say 8 units 
and 2 treatments denoted by A and B, the plan 


AAAABBBB 


is obviously a bad one. 

What makes a plan bad? It is obvious that, with n units and t 
treatments, there are nÝ possible treatment assignments, so in the case of 8 units 
and 2 treatments, there are 64 possible assignments, 2 of which are completely 
useless. The first plan is bad because the units are on a line. The 4 units that 
receive B occur later than the 4 units that receive A. In the second plan, B 
occurs after A in step. The third one that I give is a sandwich plan, which was 
discussed by Yates (1939). A plan is bad if the treatment assignment favors or 
seems to favor the treatments unequally. If one knows nothing about the units 
and they are labelled 1, 2 to 8, the first plan is not “bad”. What one wants is 
that the plan be balanced with respect to the variability among the units that 
one thinks may be present. Choice of randomization process is then a matter of 
informal Bayesian thinking. A plan is bad is the investigator thinks so. 

With 8 units on a line, one may have the opinion that the position of the 
units on the line tells one nothing about the variability among the units. One 
may think that most of the variability is expressed by a difference between the 
first 4 units and the second 4 units. One would then use this partition as a block 
partition. But, obviously, doing this is only part of the problem. One would still 
have to decide how to place A and B within each block. The sandwich plan 
seems not unreasonable. Another plan would be to partition the 8 units 
segmentally in blocks of 2. Then one would have to decide how to place the 
treatments within the resultant blocks. It is reasonable to surmise that 


| AB| AB| AB| AB| 


is a bad plan. 

Suppose we wish to compare 2 treatments on a piece of land. We could 
partition the land into 2 pieces, one of which would receive A and the other B. 
This would be an appallingly bad choice. Why? What informative model can 
one use? How could one obtain an idea of error of conclusions? We could divide 
into 4 pieces of land, into 8 pieces, into 16 pieces, and then decide on a 
partitioning of the pieces into blocks. We could partition the land into a 2x2 
array and assign the treatments according to a 2x2 Latin square. We could 
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partition the land into a 4x4 array and then use a plan in which A and B each 
occur twice in each row and in each column. There are undoubtedly many other 
possibilities. 

How should one choose among all the possibilities? Why does the 
problem of choice arise? It arises because we have to decide how to partition the 
experimental material into pieces such that all subpieces of a piece receive the 
same treatment and then, of course, assign the treatments to the pieces. In the 
case of the agronomical field plot trial, the pieces are called plots, and the choice 
of plots is a matter that is discussed under the rubric Field plot technique. I shall 
not discuss this. 

It is obvious intuitively that the pieces, the plots or the experimental 
units should be partitioned into subsets that are as alike as possible, with a 
subset for each treatment. But one can only guess about the alikeness of the 
units. One’s guesses about alikeness may prove to be very poor. The actual 
experiment must be such that one can form a judgment about the alikeness of 
the units and then apply that judgment to form objective judgment about the 
alikeness of units receiving different treatments. 

I am saying nothing new in these remarks. The ideas are all in Fisher’s 
The Design of Experiments. Fisher discussed only two designs, the randomized 
block design and the Latin square design in that book. Various other designs are 
discussed by Cochran and Cox (1957). Later, Yates initiated the ideas of 
incomplete block designs and designs for two-way elimination of heterogeneity. 

The ideas used for making analysis of the resultant data were those of 
linear models and analysis of variance. Fisher proved (insofar as Fisher proved 
anything!) that, if one used the customary randomization of the randomized 
block design and of the randomized Latin square design, then treatment 
comparisons were unbiased (meaning that the comparisons estimated by the use 
of the ordinary linear models and the method of least squares were unbiased for 
what one would observe if one could assign every treatment to every unit). Also 
the variance over randomizations of estimated treatment comparisons could be 
estimated by analysis of variance, if unit-treatment additivity holds, though 
Fisher was not aware of this requirement. Later, Yates gave the idea that the 
design should be unbiased in the sense that the expectation of the treatment 
mean square should equal the expectation of the residual (error) mean square in 
the absence of treatment effects. 

The properties indicated in the previous paragraph hold in the case of the 
randomized design only if each block comprises a completely randomized design. 
The requirement for the Latin square design is unclear, except that the properties 
are realized if one chooses a Latin square plan from the totality of Latin squares 
of the given size. 

It was realized, for example, by Grundy and Healy (1950), that the use 
of such randomization gave some realizations that were bad. For example, on a 
piece of land with 3 blocks of 4 plots, the blocks being aligned, one might obtain 
the plan: 
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BCAD 
BCAD 
BCAD 


This is obviously a bad plan. An 8x8 Latin square design involving several 
factors each at 2 levels could result in the levels of one of the 2 level factors 
occurring in the 4 quarters of the square. Grundy and Healy made a suggestion 
of a restricted randomization plan. Youden (1956) discussed the problem, as did 
Sutter, Zyskind and Kempthorne (1963). There has been extensive work in 
recent years by Bailey and others. 

The whole line of development with regard to restricted randomization 
appears to have been dominated by analysis of variance unbiasedness. 

It is worthwhile to note that the randomized block design is a restriction 
of the completely randomized design and that the Latin square design is a 
restriction of a particular randomized block design, so the idea of restricted 
randomization goes back to the beginnings of the subject of design. 

In recent years, I (Kempthorne, 1986a, b) have reached the opinion that 
the whole matter of randomization, and associated estimation and tests of 
significance, needs to be rethought in what is, conceptually, a very simple way. 
We realize, or should do so, that use of the classical designs is based on a sort of 
informal Bayesian process, in which one guesses or judges, or suspects or surmises 
(but does not believe) that the pattern of variability among the experimental 
units is such and such; for example, units within blocks are very much alike, 
while the units in different blocks differ appreciably. 

The suggested procedure is that the experimenter specifies a set of plans, 
which he surmises will give fair comparisons among the treatments. He (she) 
then uses this set as a randomization frame for choice of plan that is used and for 
the randomization test of the null hypothesis of no treatment differences and for 
the randomization test of any shift alternative by adjusting the data to the null 
hypothesis. 

In the case of experimentation on a line, the only attribute of a unit that 
is known is z, equal to its position. One can then pick out of the totality of 
plans, those for which Ez is nearly the same for the various treatments and Dz? is 
nearly the same: Any one plan in which this occurs can be regarded as a 
systematic design, of course. Indeed, any plan produced by randomizations looks 
to be systematic if one looks at it long enough. 

I use the case of experimentation on a line because the implications are 
obvious. The extension of the basic idea to experimentation on a plane or a set 
of units in R* is intuitively clear but not easy to implement. 

My discussion brings to mind the argumentation in the ’30s and ’40s 
about the value of systematic designs. Fisher (1937) gives a discussion on this 
that is useful but not forcing. He assumed, without even mentioning so, that the 
proper way to analyze a systematic design was by means of the same linear 
model as that he used for a randomized design. He then gave compelling reasons 
in that framework for his view that the systematic Latin square designs were 
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variance biased in the sense that the expectation under the null hypothesis of the 
treatment mean square would be less than the expectation of the error mean 
square. I say that Fisher’s discussion is not forcing because it is not at all clear 
that the analysis of the data set resulting from any plan should be based on the 
obvious Gauss-Markov-Normal-Linear Model (GMNLN) theory. Considerations 
of expectations, variances and covariances under randomization does suggest that 
GMNLN theory can be used as approximating randomization distribution theory 
if the classical randomization procedures are followed. 


The Work of R. A. Bailey 


Bailey (1983, 1985) has written very informatively on restricted 
randomization versus blocking and cites much literature that is strongly relevant. 
I suggest that these papers be read. She made (Bailey, 1983, p. 17) critical 
remarks about blocking that are very similar to those I have made in this essay 
and in Kempthorne (1986b, c), where I failed badly in not knowing and 
recognizing her work. 

It appears that, if the plots (or units) lie be in a regular configuration 
with nice dimensions (e.g., a 2x4 or 8x8 array), one can bring ideas of 
permutation groups to bear. 

I have three comments on this line of work. First, it seems that it is 
only in very special cases that the conditions demanded can be met. What is a 
good thing to do, for instance, with 12 units on a line and 3 treatments? Second, 
the requirement is imposed that the design has to be valid in the sense that the 
analysis of variance based on a linear model gives a treatment mean square and 
error mean square that have equal expectations under the randomization in the 
absence of treatment effects. Third, along with the use of analysis of variance, 
which I have just questioned, there is the problem of how to make tests of 
significance and how to make interval statements about treatment effects. This 
is where Kempthorne came in some decades ago. In his book (Kempthorne, 
1952), he took the viewpoint that GMNLN theory can be used as an 
approximation to randomization theory, with respect to estimation of effects, 
estimation of error and statistical tests (and, hence, intervals on parameters). It 
is rather obvious, I think, that with restricted randomization this will not 
happen. It does not happen with small classical restricted randomized 
experiments; e.g., the 3x3 Latin square design (Kempthorne, 1952, pp. 193-195). 
It is on the basis of such thinking that I advocate the construction of a list of 
acceptable plans and using this list for design and statistical testing. The point is 
that an estimate and standard error of estimate are useless except for the 
construction of a pivotal, and a pivotal with distribution over 2 or 6 points is 
rather useless. 


Some Closing Remarks on D. Basu 


I have found myself in an anomalous position with respect to the 
writings of Basu (to whom I have referred at times as my beloved enemy). It is 
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obvious that Basu is highly expert in mathematical statistics at an advanced 
measure theory level. I can only admire this aspect. It is also obvious that Basu 
is deeply interested in inference. I have found that I agree rather strongly with 
some of his criticisms of NPW theory. However, I judge that Basu is a sort of 
Bayesian, and it is clear from the present essay, I imagine, that I am strongly 
averse to Bayesian writing that I have seen. 

I am particularly averse to the introduction of formal Bayesian processes 
in the design and analysis of the comparative intervention experiment. I would 
like to read an account by a dedicated Bayesian of a real experimental situation, 
with the real outcome and with the statement of conclusions. In the absence of 
such, I suggest that Bayesian writings be ignored. 

I am not at all clear on whether Basu has written on the problems I 
discuss. I hope that I have not committed any injustices. 

I attempted (Kempthorne, 1980) to give my reactions to Basu’s writing 
on the Fisher randomization (Basu, 1980) and decided that repetition of this 
would serve no useful purpose. The aspect that I did not emphasize then is the 
matter of design. The obviously Bayesian nature of design surely needs 
consideration. The discussion or argumentation of 1980 had little, if any, 
relevance to the problems of experimental method. 
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Introduction 


A statistic is ancillary if its distribution does not depend on the 
parameters of the model. It might appear at first sight as if ancillary statistics 
could make no contribution to inference about these parameters. However, as 
was pointed out by Fisher who first defined and named the concept (1925, 1934, 
1935, 1936), this appearance is deceptive. By themselves ancillaries of course 
carry no information about the parameters, but they may be very useful in 
conjunction with other parts of the data. 

Ancillarity has connections with many other statistical concepts, among 
them sufficiency, group families, conditionality, completeness, information, pre- 
randomization, and mixtures. Its most important impact on statistical 
methodology comes from the suggestion that inference should be carried out 
conditionally given an ancillary statistic rather than unconditionally. For small 
samples, the resulting conditional procedures can be less efficient than their 
unconditional counterparts; however, they have the advantage of greater 
relevance to the situation at hand and frequently are simpler. Typically, the effi- 
ciency difference tends to disappear as the sample size becomes large (see for 
example Barndorff-Nielsen, 1983, and Liang, 1984). 

Since ancillaries typically are not unique, the recommendation to 
condition on an ancillary is not sufficiently specific. Conditioning comes closest 
to its purpose of making the inference relevant to the situation at hand if the 
ancillary is maximal, i.e. if there exists no other (nonequivalent) ancillary of 
which it is a function. The concept of maximal ancillary, which is basic to the 
theories of ancillarity and conditioning, was introduced by Basu (1959) who 
showed that maximal ancillaries always exist,” but noted that even they may not 
be unique. In the same paper he also pointed out some measure theoretic com- 
plications which require the slightly weaker definition of essential maximality for 
their resolution. Further results and some basic examples were given in Basu 
(1964) and some additional generalizations in Basu (1967). 


lResearch partially supported by NSF grant DMF-8908670. 


For a more precise statement see Theorem 3. 
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Ancillarity is in a certain sense the dual of sufficiency. If T is a sufficient 
statistic, then any inference can be based solely on T, and the conditional 
distribution of the full data set X given T is independent of the parameters. Con- 
versely, if V is ancillary, inference may be based entirely on the conditional 
distribution of X given V, while the distribution of V is independent of the 
parameters. In this duality, a maximal ancillary corresponds to a minimal suffi- 
cient statistic. They differ however in that a minimal sufficient statistic is essen- 
tially unique and that explicit methods for its construction are available, neither 
of which is the case for maximal ancillaries. 

Systems including sufficient and ancillary statistics as special cases are 
discussed in Basu (1967). Another common generalization of both sufficiency and 
ancillarity are the corresponding concepts (partial sufficiency and partial ancil- 
larity) in the presence of nuisance parameters. Discussions of these concepts can 
be found, for example, in Dawid (1975), Basu (1977), and Barndorff-Nielsen 
(1978). 

General discussions of various aspects of ancillarity are given by Cox and 
Hinkley (1974), Hinkley (1980b), Buehler (1982), Kalbfleisch (1982), and 
Lehmann (1986). A recent important development is the extension to asymptotic 
ancillarity, i.e. statistics with limit distribution independent of the parameters, 
and from that to higher order and local ancillaries. In the present paper, we shall 
restrict attention to exact ancillaries with respect to all unknown parameters, i.e. 
in theoriginal sense considered by Fisher and Basu. However, work on both 
partial and approximate ancillaries is included in the references. 


Relation to Other Concepts 
1. Group families 


A group family or transformation model is obtained by subjecting a 
random variable with a fixed distribution to a group § of transformations. Any 
statistic V(X) that is invariant under § is ancillary. Thus in particular a 
maximal invariant with respect to G is ancillary. 


Example 1. Location family. 


Let X = (Xj,...,X,,) be distributed according to a location family with 
density 


fz, — 9) +++ Kz, - 9). 


This is a group family obtained by subjecting a random variable X = (Xj,...,X,,) 
with density f(z,,...,z,,) to the group of transformations 


Xi=X;,+ce i=1,..,4n, -o0 < ¢ < œ. 


A maximal invariant is the set of differences 
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= (X = "X n-1 7 Xa): 


els 
This is the example with which Fisher introduced the concept of ancillarity. 
For some general results for the case of group families see Barndorff- 


Nielsen (1980). 


2. Mixture experiments 


Suppose a family of experiment &,, z € Z@ is available, each experiment 
consisting of a family of distributions P, = = {P, g 0 E Q}, labeled by the same 
parameter 0, i.e. corresponding to the same states of nature. A value of z is 
selected according to a known distribution II and the experiment &, is performed, 
resulting in the observation of a random quantity X with distribution P z0 For 
the final result X of such a mixture experiment, Z is ancillary since its 
distribution II is known. 


Example 2. Two workers. 


Let 
So = (XK, Ph PS { Po, 6€ Q} 


8, = (Y, Q), Q = {Qp, 0 € Q} 


be two experiments, corresponding for example to two different workers A and B 
performing a needed experimental task. One of the workers is chosen at random 
(with probability 1/2 each) and is assigned to perform the experiment. Here a 
random variable taking on the values of 0 and 1 as worker A or B is chosen plays 
the role of Z. The example, which was first discussed in this context by Cox 
(1958), makes clear the appeal of conditioning on the experiment actually 
performed. 

Mixture models appear to represent a rather special case of models 
admitting ancillaries but in fact, unlike group families, they cover all cases. To 
see this, suppose that X is distributed according to one of the distributions Po, 
6 € Q and that V is ancillary for X. For each value v, let &, be the experiment 
consisting in observing a random quantity X’, distributed according to the 
conditional distribution of X given v. Then X’ is the outcome of a mixture 
experiment and its distribution is the same as that of X. 

Some authors have introduced distinctions between real and conceptual 
(Basu, 1964) or experimental and mathematical (Kalbfleisch, 1975, 1982) 
ancillaries. However, these distinctions require going outside the postulated 
models and are based on considerations involving other models. 


3. Conditionality; pre-randomization 


Fisher’s suggestion that inference should be conditional on an ancillary is 
called the principle of conditionality. As was discovered by A. Birnbaum (1962), 
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conditionality has surprisingly strong consequences for the foundations of 
statistics since in conjunction with sufficiency it implies the likelihood principle. 
For discussions of this result and its consequences see Rao (1971), Basu (1975), 
Joshi (1983), Berger and Wolpert (1984), and Evans, Fraser and Monette (1986). 

Typically, conditioning on ancillaries seems reasonable. However, it runs 
into difficulty when the design involves deliberate randomization (e.g. random 
selection of a sample, random assignment of subjects, or random choice of a 
Latin square). Since the random selection process with known probabilities is 
ancillary, the conditionality principle would require conditioning on the selected 
arrangement, thus largely vitiating the purposes of randomization. This 
difficulty is discussed, for example, in Basu (1969, 1978, 1980), Berger and 
Wolpert (1984), and Finch (1986). 


4. Sufficiency 


Sufficient statistics provide data reduction without loss of information. 
The amount of reduction that can be achieved in this way depends on the 
situation. 


Example 1. Location family (continued). 


If the density fin Example 1 is the standard normal density, sufficiency 
reduces the full n-dimensional sample Xj,...,X, to the single statistic X = 
by, X;/n, regardless of the size of n. On the other hand, if f is, for example, the 
logistic, Cauchy, or double exponential density, the minimal sufficient statistic is 
the set of order statistics X/,) < ... < Xp» so that there is hardly any 
reduction. As discussed in Lehmann (1981), the amount of reduction depends 
essentially on how much of the ancillary information the minimal sufficient 
statistic retains. 


5. Completeness 


The most favorable situation for reduction by means of a sufficient 
statistic T is that in which all ancillaries are independent of T. A sufficient 
condition for this to occur is given by the following result which (together with a 
converse) is known as Basu’s theorem (Basu, 1955, 1958, 1982 and Koehn and 
Thomas, 1975). 


Theorem 1. (Basu). 


If T is boundedly complete, then every ancillary is independent of T. 

That bounded completeness is not necessary for every ancillary to be 
independent of T can be seen for instance from examples in which the constants 
are the only ancillaries. A condition that is necessary, but not sufficient, is pro- 
vided by the concept of weak completeness, introduced by Basu and Ghosh 
(1968), and independently in the present context by Lehmann (1981) under the 
term $,-completeness. 
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Definition 1. 


A statistic T is weakly complete with respect to a family pT — { Po, 
6 € Q} of distributions of T if 


EKT) = 0 for all 0 € Q = f(t) = 0 (ae. pT) 


for all two-valued functions f. 

As we shall see later, this concept is central to the study of maximal 
ancillaries. 

Note. A (not very useful) completeness condition that is both necessary 
and sufficient for every ancillary to be independent of T is given by Lehmann 


(1981). 


6. Conditionality and sufficiency in conflict 


The principles of conditionality and sufficiency may conflict, as in the 
following example of Becker and Gordon (1983), which is essentially equivalent 
to one considered in a different content by Fisher (1956, p. 47). 


Example 3. Quadrinomial. 


Consider n quadrinomial trials with the probabilities of the four 
outcomes being 


1+8 1- 8 1- 0 2+0 
4 n P2 = E P3 = 5 T E -1 < < 1, 


and with N,,...,N, denoting the numbers of the trials resulting in these outcomes. 
Then T = (N,, No+N3, N4) is minimal sufficient and it appears that there are no 
ancillaries based on T. On the other hand, A = (N,+N., N3+N4) is clearly 
ancillary, and so is B = (N,+Ng3, No+N4). 

It seems clear to the present authors that here sufficiency should be given 
priority over ancillarity, and inference should be based on T. For otherwise, 
given a trinomial situation with probabilities ((1 + 6)/5, (1 - 26)/5, (2 + @)/5)), 
(the distribution of T), we would prefer a procedure that would require dividing 
the trials in the middle category, each with probability 1/2 between two artificial 
subcategories. This seems very unappealing. 


7. Similar regions and regions of Neyman structure 


A set Sin the sample space is a similar region with respect to a family P 
= {Py 0 € Q} if P(X € S) does not depend on 9, i.e. if its indicator is 
ancillary. The set S is said to have Neyman structure with respect to a sufficient 
statistic T if the conditional probability 
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P(X € S|t) is independent of t a.e. 


Suppose now that T is boundedly complete. Then by Theorem 1 every 
ancillary — and therefore the indicator Iç of any similar region — is independent 
of T and therefore has Neyman structure. The characterization of all similar re- 
gions as having Neyman structure in the presence of a complete sufficient statistic 
is therefore mathematically (although not in its interpretation) equivalent to 
Theorem 1. 


8. Information 


Fisher’s primary interest in introducing ancillary statistics was the 
recovery of information. If Iy(@) and T,(9) denote the amount of Fisher 
information in the sample X and the o likelihood estimator 6 
respectively, then it will often happen that! I,(8) < (9), so that Ô is not fully 
informative. Fisher discovered that the lost alornmnation can be recovered if there 
exists an ancillary statistic V such that (ô, V) is sufficient, in the following sense. 
If Iy (9) is the information carried by ĝ in the Saito distribution given V = 


v, hen 
Ely (8) = Ix(6). (1) 


For a discussion of the implementation of this program in two important classes 
of models, see Barndorff-Nielsen (1980). When (1) holds, the average conditional 
information equals the whole information in the sample; for particular values of 
v, the conditional information of 6 given v may be smaller or larger than J x(9). 

Recall now the other motive for conditioning on ancillaries: to make the 
inference more relevant to the situation at hand. Cox (1971) points out that 
ancillaries are therefore most useful when the amount J,(@) of information in the 
conditional distribution of X given v varies widely with v, so that some values of 
v are much more informative than others. This point is nicely illustrated by 
Example 2, where conditioning on the chosen worker seems particularly 
important when there is a big difference in the quality of their work. 

In the light of this remark, Cox suggests that when the maximal 
ancillary is not unique, that ancillary should be preferred for which [,(@) is most 
variable, e.g. for which the variance var[J,(@)] is the largest. 


Weak Completeness 


The central concept for the characterization of maximal ancillaries is 
weak completeness. It is easy to see that the definition of weak completeness 
given in the preceding section is equivalent to the following statement. 


1We have here assumed for the sake of simplicity that @ is real valued. 
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The family P = {P}, 0 € Q} is weakly complete if any measurable 
set A with probability independent of 6 has probability 0 or 1. (2) 


This is the form in which the definition was given by Basu and Ghosh (1969). A 
simple restatement of (2) yields Theorem 2. 


Theorem 2. 


A family P admits no nontrivial ancillaries (i.e. any ancillary statistic is 
almost surely constant) if and only if P is weakly complete. 

To illustrate the situation of no ancillaries consider the following 
examples. 


Example 3. No ancillaries. 
Let X; be independent N(6,, 1), i = 1,...,n. Then X = (Xj,...,X,,) is 


complete, hence weakly complete, and so there are no ancillaries. 


Example 4. Sequential binomial sampling. 


Consider a sequence of binomial trials, with success probability p and a 
stopping rule (with probability 1 of eventually stopping). This can be 
represented by a random walk in the plane starting at the origin, with a unit step 
to the right for a success and a unit step up for a failure. The stopping rule is 
represented by a set of stopping points. The observation is a path starting at 
(0, 0) and ending at some stopping point (a, b). Since every path ending at (a, 6) 
has probability p*(1 - p)? , it follows that the coordinates (a, b) of the stopping 
point constitute a sufficient statistic, which may or may not be complete 
(necessary and sufficient conditions for completeness are given in Lehmann and 
Stein, 1950). The path itself is of course not complete except in the rare cases in 
which there is only one path to each stopping point. 

(i) In light of this it is very surprising that not only the endpoint but 
also the path itself is weakly complete, provided the stopping rule has a finite 
boundary point on the z or the y axis. To see this let S be a set of paths with 
P 5) = =c Vp E (0, 1). Suppose the stopping rule has a finite boundary point 
(0, k) for some k > 1. Then the path ro from (0, 0) to (0, k) is either contained 
in S$ or in its complementary set of paths S°. It follows that other c= P (8$) > 
P (To) = (1 - p)* Or ee e > P (To) = (1 - p)* — 1 as p — 0 so 
that either c = 1 or c = 0. The case of finite. boundary point (k, 0) is treated 
similarly. Hence there are no ancillaries. 

(ii) If there is no bound on the stopping rule along the z- or yaxis, then 
weak completeness may not obtain as the following example shows. Perform the 
binomial trials in pairs until the first time that either (success, failure) or 
(failure, success) is observed. Then the set S of paths that end in (failure, 
success) has probability 1/2 for all p € (0, 1). 
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Note. Exactly the same result as in Example 4 with the same proof 
applies to sequential sampling from trinomial (or any multinomial) trials. 

The following example is due to Basu and Ghosh (1969) where many 
additional examples can be found. 


Example 5. ‘Two-point location families. 
Let X take on the two values 6 and 0 + c with probabilities 


P(X = 0) = 17, 
P(X=6+c)=1-7, -œ < 6 < œ, 


m and c known. Then X is weakly complete provided m Æ 1/2, but not when z 
= 1/2. In the latter case any set A whose complement is A + c has probability 
1/2, independent of 6. 

It turns out that Theorem 2 is a special case of a general characterization 
of maximality for an ancillary statistic V, given in its proper setting in Theorem 
4. Loosely, this characterization finds V to be maximal if and only if the family 
of conditional distributions of X given V is weakly complete. In the situation of 
Theorem 2, where V is constant, this family of conditional distributions coincides 
with the family P of distributions for X. 

In the case when the only ancillary statistics are the a.s. constant 
functions there (usually) does not exist a maximal ancillary (due to null set 
problems) but a maximal ancillary o-field A,, does exists, see Theorem 2. The 
reason is that not every o-field is induced by a statistic. Since the o-field induced 
by an a.s. constant function is essentially equivalent to A,, (to be made precise 
below) it makes sense to call such an a.s. constant function essentially maximal 
ancillary; the alternative would be to admit that there are no maximal ancillary 
statistics due to null set problems. This state of affairs carries over to the general 
case and the above loosely stated characterization is that of essential maximal 
ancillarity. Bearing this in mind one may want to accept that characterization 
and skip or skim the next two sections. 


Notation and Definitions 


Let (%, B) be an arbitrary measurable space and {P}, 6 E Q} bea 
family of probability measures on $. Considering % as the sample space we 
denote the random element in $% by X and write P(X € B) = P(B)VBeE &B. 
We now give some definitions and a theorem taken from Basu (1959). 


Definition 2. 


A o-field A C & is said to be ancillary if Pg(A) is constant in 6 € QV 
AEA. 
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Comment. One easily sees that A is ancillary iff f f(z)dP,(z) is constant 
in @ € Q for all integrable and A-measurable functions f S—R. 


Definition 3. 


If V: (%, B) — (Y, ©) is a statistic (Ay := V'(C) C B) then Vis said 
to be ancillary if Ay is ancillary. 

Comment. Rather than dealing with (ancillary) statistics we follow 
Basu’s example and continue the following theoretical exposition in terms of 
(ancillary) o-fields. When dealing with concrete examples we will use the more 
intuitive term statistic in place of o-field. Hence it is understood that the 
following definitions in terms of o-fields have analogous counterparts in terms of 
statistics. 


Definition 4. 


An ancillary o-field A C & is said to be mazimal ancillary if there 
exists no other ancillary o-field A* C ® such that A C A*. 


Theorem 3. (Basu, 1959). 


Given an ancillary o-field A C &® there exists a maximal ancillary ø- 


field A,, C B such that A C Ags 


Definition 5. 


Two o-fields A}, A, C B are said to be essentially equivalent if for any 
A, E A, (Ag E A,) there exists an A, E A, (A, E A) such that 


P,(A,AA,) =0 VO EQ. 


Definition 6. 


Any ancillary o-field that is essentially equivalent to a maximal ancillary 
o-field is called essentially mazimal ancillary. 

Comment. Although Theorem 3 guarantees the existence of a maximal 
ancillary o-field A,, containing any given ancillary o-field A the same does not 
necessarily hold for statistics. The reason is that A,, is usually too rich to be 
generated by any statistic V. 

The following definition of conditional weak completeness is a direct 
adaptation of the concept of weak completeness to the conditioned case. 


Definition 7. 


X given A is said to be conditionally weakly complete if for any given 
function 


g(t) = a(z)Ip(2) + b(2)Ipe(z) 
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with B € &, a(-) and b(-) A-measurable and such that 
VO E Q Ep(g(X)|A) = 0 as. (Po) 


we have 


V@EQ Po(g(X) = 0|A) = 1 8:8: (Po), 


i.e. Po(g(X) =0)=1VO ER. 
An equivalent formulation of Definition 7, without the a.s. qualifiers, is 
Definition 7’. 


Definition 7’. 


X given A is said to be conditionally weakly complete if for any given 
function 
g(z) = a(z)Ip(z) + B(2)Ipc(2) 
with B € ®, a(-) and b(-) A-measurable and such that 
EXX) =0 VIED andVAE A 


we have P(X) =0)=1 VOER. 

Note that Definitions 7 and 7’ are not contingent on the existence of 
regular conditional distributions. However, if X admits regular conditional 
distributions given A a natural question is: how does weak completeness of a 
family of regular conditional probability distributions relate to the conditional 
weak completeness of X given A defined above? Lemma 1 will provide a partial 
answer under certain regularity conditions. These conditions are as follows: 


i) Q is a separable topological space, 
ii) A is generated by the ancillary statistic V: (%, B) — (Y, ©), 


iii) Vv € Y: {fo(-|v), 0 € Q} is a family of conditional densities for 
X given V = v with respect to a o-finite dominating measure p on 


(%, 3), 


iv) JN € Cwith P(V € N) =O0sothat Vv € N° we have f,(2|v) > 
fo Cl v) a.s. in z[u] whenever 0 > 9p, 


Lemma 1. 


Under conditions i) - iv) the weak completeness of the families {f,(- |v), 
0 e Q} Vv e Ni with P(N,) = 0 implies the conditional weak completeness 
(Definition 7) of X given A. 
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Proof: Let g be as in Definition 7, then for any 6 € Q we have 


0 = EXV) = fol2)fo(zlV)du(z) a.s. P. (3) 


Since the exceptional null set may depend on 6 (through fọ) we invoke (3) for all 
6 in a countably dense subset of Q. Using Scheffe’s theorem in conjunction with 
iv) it follows that there exists a set Nọ € C with P(V € No) = 0 such that for 
v € No we have 


0 = fo(a)folalr)du(z) VO EO 


which by weak completeness of the conditional densities entails for all v € Nọ 


J Lyq¢x)=0pfo(2lr) du(2) mel | V0 € Q, 


hence Pa((X) = 0) =1V0 ER. 
It is not clear whether the converse of Lemma 1 is true under the stated 
conditions. 


Characterization of Maximal Ancillarity 


The following theorem will give necessary and sufficient conditions for an 
ancillary o-field A C B to be essentially maximal ancillary. A special case of 
Theorem 4 was proved by Basu and Ghosh (1969) for the case of a dominated 
location family. 


Theorem 4. 


If A C Bis ancillary, then the following statements are equivalent: 
i) A is essentially maximal ancillary. 


ii) AB € B such that P,(B|A) admits a version pp (A-measurable) 
independent of 9 € Q with P(0 < p(X) < 1) > 0. 


iii) X given A is conditionally weakly complete. 
Proof. i) => ii): A be ancillary and let B € ® be such that P,(B|A) 


admits a version ýp (A-measurable) independent of 0 € Q. First note that the 
smallest o-field Ap containing both A and B is ancillary, since 


P(A N B)= J YDP) AE A 


is independent of 6 € Q (A is ancillary and yp is A-measurable and independent 
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of 6 € Q) and since this property extends to all of Ag by the usual unique 
measure extension. 

Next let AJ = {r E€ G:0 < ypz) < 1} E A. Assuming A to be 
essentially maximal ancillary we can find A, € A such that N = A,A(A) N B) 
€ Ap has probability zero for all 0 € Q. Then 


Ty (X) = Lg AH) VX € NE 
and taking conditional expectation given A we have 


which implies P(0 < wg < 1) = 0, thus i) > ii). 
ii) => iii): Let 


gz) = a(z)Ip(z) + b(a) pel2) 
= (a(z) — b(z))Ip(z) + 6(2) 
B € 8, a(-) and b(-) A-measurable such that 
VO EN Ej fg(X)A)=0 as. Py. (4) 
Let Cy = {r E€ S: alz) Æ U(2)}, By = Cy N Band 
Yg (2) = (2)/( H(z) - a(z)) E Co 
= 0 z E C 


The condition (4) on g implies that 4% B, May serve as a -independent 
version of Po( By|A) V 6 EQ, since 


0 = Eg(Ic,(X)9(X)|A) 
= (a(X) E b(X)) Po( BolA) + (xa (x) a.S. Po; 


l.e. 


X E€ Co = Po BolA) = 6(X)/(0(X) - a(X)) 
= vB (4) a.s. Po 


and for X € Co > P,(B|A) = 0 = VB, (X) a.s. Po. 
Condition (4) also implies 


= Eol) oel) = HX) oc(X) a.S. Po VOER. (5) 
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Since P(0 < YB, < lys 
ii) = P(g, € {0,1})=1 
= Ig (5) = vp (%) as. PVO EQ 
=> (Xo (X)=0 as. PyVO EO 


since 


0 = EX (XA) 

= (a(X) - 1X) bp (X) + (Xo (X) 

= (a(X) - (X)) Ip (X) + (XIc (X) 

= Io (X) 9%) as. PVO ER. 
This together with (5) implies Pg(g(X) = 0) =1V 0 € Q, ie. ii) = iii). 

ii) => i): By theorem 4.1 there exists a maximal ancillary o-field A,, D 
A. Let Dy E Anm and for some fixed 09 E Q and some version Pg, (Dol A) let 
YD, (2) = = Pa, (DlA), then for Á € A: 
J aPo(DolA)¢P9 = P(A N Do) = Pa (A N Do) 


= J A%n (2) 4Po (2) = Í A%n (2)4Po(2), 


i.e. Y Dy May serve as a -independent version of Pa(Do| A) V8 E Q. Let 


9(2) = Ip (2) - Yp (2) 


then V0 E€ Q E((X)|A) = 0 a.s. Pg, which under iii) implies Py(g(X) = 0) = 
1V8 E Q, ie. 


ýp (%2) = In a.S. Po VOER 
which shows A and A, to be essentially equivalent, i.e. iii) = i) q.e.d. 


Examples 


In the examples that follow it is understood that when claiming maximal 
ancillarity what is really meant is essential maximal ancillarity. However, these 
two concepts coincide when the null set issues do not arise, as in situations when 
that ancillary is discrete. 
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Example 6. 


With probability 1/2 let X,,....X, be iid. from N(0, 1), and with 
probability 1/2 from M(6, 2). Let J = 1 or 0 as the first or the second case 
obtains. Then V = (J, X,-X,,....X,-;-X,) is maximal ancillary since (I, 
Xis- X,„) is equivalent to (X, V) and the conditional distribution of X given V is 
complete. 


Example 7. 


Let X,,...,X, be ii.d. with continuous and strictly increasing c.d.f. F. 
This model is invariant under the group G of common, continuous, strictly 
increasing transformations X; = g(X,), i = 1,....n. Maximal invariant is the 
vector of ranks (Rj,...,2,,) of the n Xs. Since the group G is transitive, the 
maximal invariant is ancillary. Is it maximal ancillary? Since the conditional 
distribution of the X’s given the ranks is the same as the joint distribution of the 
rank permuted order statistics and since the distribution of the latter is complete, 
hence weakly complete, it follows that the ranks are maximal ancillary. 


Example 8. 


In Example 6.2, suppose attention is restricted to F with median 0. Now 
the ranks are no longer maximal ancillary since the ranks together with the 
number of positive observations are ancillary. This latter ancillary is maximal 
since the order statistic given the number of positive and negative observations 
are complete. (We are dealing with n, and n_ functions on (0, oo) and (—oo, 0), 
respectively. Note: This maximal ancillary is a maximal invariant under a 
smaller group than in Example 6.2, namely the group G of transformations g 
which are continuous, strictly increasing and satisfy g(0) = 0.) 


Example 9. 


Let Xj,...,X, be iid. N(O, 1). Here of course the vector of differences 
(X-X, Xn-1 Xn) is maximal ancillary since the distribution of X is complete. 

As has been pointed out by Basu (1959) and others, maximal ancillarity 
does not mean that there are no other maximal ancillaries. As a well known 
example, in the present case with n = 2, we have that V = (X,-—X,)sign(X) is 
ancillary. To see that it is also maximal note that (X,, X,) is equivalent to 
(X, V) and that X and V are independent. Now the completeness of X entails 
the conditional weak completeness of (X, V) given V. 

Another maximal ancillary is V’ = X-X,- Which of these two 
ancillaries is preferable? The Cox criterion discussed in part 4 of the section 
“Relations to Other Concepts” does not distinguish between them; however, a 
criterion advanced by Barnard and Sprott (1971) applies and gives preference to 
X-X; since it is invariant under translations (see Padmanabhan, 1977). 


46 E. L. Lehmann & F.W. Scholz 


Example 10. 


Let X,,...,X, be iid. uniform on [6, 0+1). Here (denoting by [z] the 
integer part of z) (X,-X,,....X,-1-X,,) together with X,, — [X,] are ancillary and 
are easily seen to be maximal ancillary since the conditional distribution of [X,] 
(all that is left of the data for any fixed @) given that ancillary is just a one point 
distribution which is complete. 

Basu (1964) treats this example in the case n = 1; Basu and Ghosh 
(1969) treat the same example for the case of arbitrary n for which they 
determine the maximal ancillary o-field. 

Basu and Ghosh (1969) show that a sufficient condition for weak 
completeness of the location family of densities {f{z-0): 6 € R} is that the 
characteristic function f(t) = f erp(—itz)f(z)dz of f has at most a finite number of 
roots on the real line. 


Example 11. (Basu and Ghosh). 


Let X have density f(2-0) with f(z) = 2*exp(-2*/2)/N2m. Since f(t) = 
(1-1?) exp(-t?/2) which has only two roots it follows that X is weakly complete 
and hence admits only the a.s. constant functions as ancillaries. 


Example 12. The general location family. 


Let Xj,...,X, be iid. ~ fzx- 0) where f(z) is a density with respect to 
Lebesgue measure on R. The differences V = (Vo,....V,) = (Xy-Xq--X4-X,,) 
are ancillary and the question is for which f may one claim also maximal 
ancillarity? Examples 6.4 and 6.5 show that the answer depends on f. The 
conditional density of U = X, given V = v = (,..,0,) is hAg(ulv) = 
cf(u-0) f(u-0-v,) +++ f(u-6-v,,) with c being the appropriate normalizing constant. 
Since this yields a univariate location family {hg(ulv) = h (u-0): 0 € R} with 
h (£) = cf{z)f(z-v9) +++ f(z-v,) one could appeal to the above sufficient criterion 
of Basu and Ghosh to establish weak completeness for this family by showing 
that h (t) has only a finite number of roots. 

Unfortunately, the Basu-Ghosh criterion of a finite number of roots 
frequently is not satisfied and then does not provide an answer concerning 
maximality. Examples for which this is the case are the Cauchy and double 
exponential distributions with n = 2. 


Example 13. 


Cox and Hinkley (p. 33, 1974) give the following simplified version of an 
example due to Basu (1964) which points out the dilemma of multiple ancillaries. 
Consider N quadrinomial trials with probabilities 


1-9), 1+0), 22-0), 2248). 


If the number of outcomes in the four categories are X, U, Y, V, respectively, 
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then X + U is ancillary, as is X + V. The question is whether either is maximal 
ancillary. The answer is somewhat surprising and still mostly a conjecture. 

(i) First consider the case of X + U. The conditional distribution of 
(X, Y) given X+ U= m, Y + V= n, (n+ m= N) is that of two independent 
binomial random variables, distributed respectively as b(p,, m) and b(p,, n) with 
pı = (1 - 0)/2 and p, = (2 — @)/4. Since the conditional expectation of X/m - 
2Y/n + 1/2 vanishes for all 6 we do not have conditional bounded completeness, 
whenever m > landn > 1. If m= 0 or n= 0 completeness follows easily. 

To establish weak completeness (conditionally) one needs to show that 
for any indicator function f(z, y) with constant conditional expectation for all @ it 
follows that f is either identically one or zero with conditional probability one. 
For 0 < a < 1 consider therefore the following identify for all 8: 


A See aie 


2-6) (2+) 
H) FE) =4 
Show that f = 0 and f = 1, or equivalently that a = 0 and a = 1, are the only 
solution. Reparametrizing À = (1 - 0)/(1 + @) the identity becomes 


2 Èa TOONE + 3A)"(3 + A)? 


= o4"(1 + Ay™, 


Comparing the coefficients of Nand \™+"" for i = 0, 1, 2 on both sides of the 
identity and exploiting the binary nature of f it is easy yet tedious to show weak 
completeness for the following cases: 1) n = 1 and m = 1, m > 3 and 2) n= 2 
and m > 1. For the case (m, n) = (2, 1) we don’t have weak completeness as 
can easily be seen by using f(0, 1) = K2, 0) = 1 and f(z, y) = 0 otherwise. 

Using the reparametrization À = (2-0)/(2+6) one can show weak 
completeness for all (n, m) with 3) m = 1 and n > 1 and 4) m=2andn > 1 
(no counter example here). The above approach does not appear promising for 
the situations n > 3 and m > 3. 

(ii) Similar results can be obtained when considering the other ancillary, 
X + V, except that the above counter example does not obtain, i.e. the 
conditional distribution of (X, Y) given X + V = m is weakly complete for 
(m, n) in the following cases 1) n = 1 and m > 1,2) n=2 and m > 1,3) m= 
l and n > 1,4) m=2andn > 1. 

What does this mean with respect to maximal ancillarity of X + U and 
X + V? For N= m+n < 5 the latter is maximal ancillary whereas the former 
is maximal ancillary for N = 1, 2, 4, 5 but not for N = 3. Maximality in the 
cases N > 5 at this point can only be conjectured. 
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THE PITMAN CLOSENESS OF STATISTICAL ESTIMATORS: 
LATENT YEARS AND THE RENAISSANCE 


Pranab Kumar Sen, University of North Carolina at Chapel Hill 


Abstract 


The Pitman closeness criterion is an intrinsic measure of the comparative 
behavior of two estimators (of a common parameter) based solely on their joint 
distribution. It generally entails less stringent regularity conditions than in other 
measures. Although there are some undesirable features of this measure, the past 
few years have witnessed some significant developments on Pitman-closeness in 
its tributaries, and a critical account of the same is provided here. Some 
emphasis is placed on nonparametric and robust estimators covering fixed-sample 
size as well as sequential sampling schemes. 


Introduction 


In those days prior to the formulation of statistical decision theory 
(Wald, 1949), the reciprocal of variance [or mean square error (MSE)] of an 
estimator (T) used to be generally accepted as an universal measure of its 
precision (or efficiency). The celebrated Cramer-Rao inequality (Rao, 1945) was 
not known that precisely although Fisher (1938) had a fair idea about such a 
lower bound to the variance of an estimator. The use of mean absolute deviation 
(MAD) criterion as an alternative to the MSE was not that popular (mainly 
because its exact evaluation often proved to be cumbersome), while other loss 
functions (convex or not) were yet to be formulated in a proper perspective. In 
this setup, Pitman (1937) proposed a novel measure of closeness (or nearness) of 
statistical estimators, quite different in character from the MSE, MAD and other 
criteria. Let T} and T, be two rival estimators of a parameter 0 belonging to a 
parameter space © C R. Then T, is said to be closer to 0 than Ty, in the 
Pitman sense, if 


Pot|T, -0| < |T,- 4} = 1/2, V6 € ©, (1) 
with strict inequality holding for some 0. Thus, the Pitman-closeness criterion 


(PCC) is an intrinsic measure of the comparative behavior of two estimators. 
Note that in terms of the MSE, T} is better than T}, if 


Eo(T, - 9)? < E,(T,- 9), V9 € 9, (2) 
with strict inequality holding for some 6; for the MAD criterion, we need to 


replace E,( T - 0)? by EjT-6| In general, for a suitable nonnegative loss 
function L(a, 0): Rx R > Rt, T, dominates T; if 
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EJL(T,, 9)| < EJL(T, 0), Y0 € ©, (3) 


with strict inequality holding for some 6. We represent (1), (2) and (3) 
respectively as 


It is clear from the above definitions that for (2) or'(3), one needs to operate the 
expectations (or moments), while (1) involves a distributional operation only. 
Thus, in general, (2) or (3) may entail more stringent regularity conditions (per- 
taining to the existence of such expectations) than needed for (1). In this sense, 
the PCC is solely a distributional measure while the others are mostly moment 
based ones, and hence, from this perspective, the PCC has a greater scope of 
applicability (and some other advantages too). On the other hand, other 
conventional measures, such as (2) or (3), may have some natural properties 
which may not be shared by the PCC. To illustrate this point, note that if there 
are three estimators, say, T}, T, and T}, of a common parameter 0, such that 


ET, - 0)? < ET, - 0) 
and 


Eg(T, - 6)? < E(T}-0},VY0 € 8, (5) 


then, evidently, E,(T, - 0)? < E,(T3- 0), Y0 € ©. Or, in other words, the 
MSE criterion has the transitivity property, and this is generally the case with 
(3). However, this transitivity property may not always hold for the PCC. That 
is, T} may be closer to 0 than T,, and T, may be closer to @ than T} (in the 
Pitman sense), but T} may not be closer to 0 than T} in the same sense!. 
Although a little artificial, it is not difficult to construct suitable examples 
testifying the intransitivity of the PCC (Blyth, 1972). Secondly, the measure in 
(2) or (3) involves the marginal distributions of T} and T,, while (1) involves the 
joint distribution of (T,, T3). Hence, the task of verifying the dominance in (1) 
may require more elaborate analysis. This was perhaps the main reason why in 
spite of a good start and notable contributions by Geary (1944) and Johnson 
(1950), the use of PCC remained somewhat skeptical for more than thirty years! 
In fat, the lack of transitivity of the PCC in (1) caused some difficulties in 
extending the pairwise dominance in (1) to that within a suitable class of esti- 
mators. Only recently, such results have been obtained by Ghosh and Sen (1989) 
and Nayak (1990) for suitable families of equivariant estimators. We shall 
comment on them in a later section. Thirdly, in (1), when both T, and T, have 
continuous distributions and T, — T} has a non-atomic distribution, the < sign 
may as well be replaced by < sign, without affecting the probability inequality. 
However, if T, - T, has an atomic distribution, the two probability statements 
involving < and < signs, respectively, may not agree, and somewhat different 
conclusions may crop up in the two cases. Although this anomaly can be 
eliminated by attaching suitable probability (viz., 1/2) for the tie 
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(|T, — 6| = |T, — 0|), the process can be somewhat arbitrary and less convincing 
in general. Fourthly, the definitions in (1) through (3) need some modifications 
in the case where 9 (and T) are p-vectors, for some p > 1. The MSE criterion 
lends itself naturally to an appropriate quadratic error loss, where for some 
chosen positive definite (p.d.) matrix Q, the distance function is taken as 
IT-8 lo and given by 


e (T - 0)'Q(T - 4) (6) 


The use of the Fisher Information matrix dg as Q leads to the so-called 
Mahalanobis distance. Recall that 


E\T-Olf = Trace [QEA -9T -9 (7) 


so that (2) entails only the computation of the mean product error (or dispersion) 
matrix of T} and T,. On the other hand, if instead of |T} - 8| and |T, — 8|, in 
(1), we use || T} - lo and || T, - 8 lla» the probability statement may be a more 


involved function of the actual distribution of (7,, T) and of the Q. Although 
in some special cases this can be handled without too nile of complications (see 
for example, Sen, 1989a), in general, we may require more stringent regularity 
conditions to verify (1) in the vector case. In the asymptotic case, however, an 
equivalence of BAN estimators and Pitman-closest ones may be established under 
very general regularity conditions (viz., Sen, 1986), so that (1) and (2) may have 
asymptotic equivalence. But, in the multiparameter case, best estimators, in the 
sense of having a minimum value of (7) may not be BAN. A natural reference is 
the so-called Stein paradox (viz, Stein, 1956) for the estimation of the mean 
vector of a multivariate normal distribution. For p, the dimension of the 
multivariate normal law, greater than 2, Stein (1956) showed that the sample 
mean vector [although being the mazimum likelihood estimator (MLE)] is not 
admissible, and later on, James and Stein (1962) constructed some other 
estimators which dominate the MLE in the light of (2) [as amended in (7)]. Such 
Stein-rule or shrinkage estimators are typically non-linear and are non-normal, 
even asymptotically. Thus, they are not BAN. So, a natural question arose: Do 
the Stein-rule estimators dominate their classical counterparts in the light of the 
PCC? An affirmative answer to this question has recently been provided by Sen, 
Kubokawa and Saleh (1989), and we shall discuss this in a later section. Fifthly, 
we have tacitly assumed so far that we have a conventional fized-sample size 
case. There are, however, some natural situations calling for suitable sequential 
schemes, so that one may also like to inquire how far the PCC remains adoptable 
in such a sequential scheme. Some studies in this direction have been made very 
recently by Sen (1989a), and we shall discuss some of these results in a later 
section. Another direction in which the PCC has proven to be a very useful 
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avenue for comparing estimators is the employment of more general loss 
functions (instead of the Euclidean norm or the usual quadratic norm) in the 
definition in (1). In the context of estimation of the dispersion matrix of a 
multivariate normal distribution and parameters in some other distributions 
belonging to the exponential family, one may adopt the entropy (or some related) 
loss functions which when incorporated in (1) lead to a more general formulation. 
This has been termed the generalized Pitman nearness criterion (GPNC) (viz., 
Khattree, 1987, for the dispersion matrix estimation problem). We shall review 
some of the developments in this area in the last section. 

As has been mentioned earlier, for nearly four decades, there were not 
much activities in this general arena, while the past ten years have witnessed a 
remarkable growth of the literature on the PCC. This renaissance is partly due 
to the work of C.R. Rao (1981) who clearly pointed out the shortcomings of the 
MSE or the quadratic error loss and explained the rationality of the PCC (which 
attaches less importance to large deviations). The work of Efron (1975) also de- 
serves a special mention: the feasibility of an estimator dominating the classical 
MLE of the univariate normal mean in the light of the PCC clearly points out 
the adaptability of the PCC in a more general situation where other forms of 
admissibility criteria may not work out well. A somewhat comparable picture in 
both the works of Efron (1975) and Rao (1981) might have been based on the 
MAD criterion which attaches less importance to large deviations than the MSE 
criterion. However, in the general multiparameter case, the MAD criterion may 
lose its appeal to a greater extent. This is mainly due to the following factors: 
(i) lack of invariance under suitable groups of transformations usually employed 
in multiparameter estimation problems, (ii) complexity of the definitions and (iii) 
need for the estimation of nuisance parameters (such as the reciprocal of the 
density functions) in the definition of the norm itself which usually requires really 
large sample sizes! One might also argue in favor of some other criteria. Hwang 
(1985) has considered the stochastic dominance criterion based on the marginal 
distributions with an arbitrarily chosen cut-off point, and this in turn introduces 
some arbitrariness in the adaptation of his measure; the dominance may not hold 
uniformly in the choice of such a cut-off point. Brown, Cohen and Strawderman 
(1976) advocated the use of some non-convex loss functions. We have no definite 
prescription in favor of the PCC, MAD, such non-convex loss functions or the 
stochastic dominance criterion, although the PCC may have some natural appeal. 
In passing, we may remark that some controversies have been reported in Roberts 
and Hwang (1988), although it is very hard to endorse fully the views expressed 
in this report. We would like to bypass these by adding that let the cliff fall 
where it belongs to! In our opinion, in spite of some of the shortcomings of the 
PCC, as have been mentioned earlier, the developments in the past decade have, 
by far, been much more encouraging to advocate in favor of the use of PCC (or 
the GPNC) in a variety of statistical models which will be considered here in the 
subsequent sections. We also refer to a recent Panel Discussion on Pitman 
nearness of statistical estimators at the International Conference on Recent 
Developments in Statistical Data Analysis and Inference (in honor of C.R. Rao) 
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at Neuchatel, Switzerland (August 24, 1989), where some of these issues have 
been discussed critically, and a report of these findings is accounted in Mason, 
Keating, Sen and Rao (1990). As with any other measure, there are pathological 
examples where the PCC may not appear to be that rational, but in real appli- 
cations, we will rarely be confronted with such artificial cases. On the other 
hand, in the conventional linear models and in multivariate analyses, some 
theoretical studies (supplemented by numerical investigations) made by Mason, 
Keating, Sen and Blaylock (1990) justify the appropriateness of the PCC, even 
when a dominance may not hold for the entire parameter space. Finally, in the 
asymptotic case where the sample size is large enough to justify the usual 
regularity conditions needed to use simplified distribution theory for the 
estimators, for a wider class of nonparametric and robust estimators, we may 
justify the adaptation of the PCC on a very broad ground. We shall stress this 
point in the subsequent sections. All in all, we welcome the renaissance of the 
PCC and look forward to further developments in this fruitful area of statistical 
research. 


PCC in the Single Parameter Case 


In this section, we stick to the basic definition in (1) and examine the 
Pitman-closeness of a general class of statistical estimators. According to (1), 
rival estimators are compared two at a time, while (2) or (3) lends itself readily 
to suitable classes of estimators. This prompted Ghosh and Sen (1989) to 
consider Pitman closest estimators within reasonable classes of estimators. In 
this context, we may remark that under (2), the celebrated Rao-Blackwell 
theorem depicts the role of unbiased, sufficient statistics in the construction of 
such optimal estimators. Ghosh and Sen (1989) have shown that under 
appropriate regularity conditions, a median unbiased (MU) estimator is Pitman- 
closest within an appropriate class of estimators. Recall that an estimator T of 0 
is MU if 

Po{T < OJ =P {T > 0},Y0 E O, (8) 
and Tọ is Pitman-closest within a class of estimators (C), if (1) holds for T, = Tp 


and every T, € C. In many applications, Ty is a function of a (complete) 
sufficient statistic and T, = Ty + Z, where Zis ancillary. Then, note that 


[| Zo- 41 < | Ty - alle T -0 < (%- 64+ 2] 
> [22(T) - 0) + 2 > 0| (9) 
while by Basu’s (1955) theorem, Tọ and Z are independently distributed. Since 
Z is a nonnegative random variable, the MU character of Tọ ensure that the 


right hand side of (9) has probability > 1/2, VV @ €e ©. This explains the role of 
MU sufficient statistics in the characterization of the Pitman-closest estimators. 
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However, the following theorem due to Ghosh and Sen (1989) presents a broader 
characterization. 


Theorem 1. 


Let T be MU-estimator of 6 and let C be the class of all estimators of the 
form U = T + Z, where T and Z are independently distributed. Then 
P,{| T-0| < | U-6]} > 1/2, for all? € Oand U E€ €. 

Theorem 1 typically relates to the estimation of location parameter (@) in 
the usual location-scale model where the class C relates to suitable equivariant 
estimators (relative to appropriate groups of transformation). Various examples 
of this type have been considered by Ghosh and Sen (1989). In the context of the 
estimation of the scale parameter, the PCC has been studied in a relatively more 
detailed manner. Keating (1985) considered a general scale family of 
distributions, and confined himself to the class (C°) of all estimators which are 
scalar multiple of the usual MLE; however, he did not enforce any equivariance 
considerations to clinch the desired Pitman-closest property. Keating and Gupta 
(1984) considered various estimators of the scale parameter of a normal 
distribution, and compared them in the light of the PCC. Again in the absence 
of any equivariance considerations, their result did not lead to the desired 
Pitman-closest characterization. The following theorem due to Ghosh and Sen 
(1989) provides the desired result. 


Theorem 2. 


Let C* be the class of all estimators of the form U = T(1 + Z), where T 
is MU for @ and is nonnegative, while T and Z are independently distributed. 
Then, Pp{| T-6| < | U-0]} > 1/2,V8E€ O, U e œ. 

Both these theorems have been incorporated in the PC characterization 
of BLUE (best linear unbiased estimators) of location and scale parameters in the 
complete sample as well as censored cases (Sen, 1989b); equivariance plays a 
basic role in this context too. Further note that if T has a distribution 
symmetric about 0, then T is MU for ©. This sufficient condition for T is easy 
to verify in many practical applications. Similarly, if the conditional distribution 
of T, given Z, is symmetric about 6, then in Theorem 1, we may not need the 
independence of T and Z. The uniform distribution on [0 - 56, 6+ 56], 6 > 0, 
provides a simple example of the latter (Ghosh and Sen, 1989). 

We shall now discuss some further results on PCC in the single 
parameter case pertaining to the asymptotic case and to sequential sampling 
plans. The current literature on theory of estimation is flooded with asymptotics. 
Asymptotic normality, asymptotic efficiency and other asymptotic considerations 
play a vital role in this context. An estimator (T) based on a sample of size n is 
termed a BAN (best asymptotically normal) estimator of 6 if the following two 
conditions hold: 
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1 
n*(T., — 0) is asymptotically normal (0, o7) 
(10) 
[which is the AN (asymptotically normal) criterion], and 
a4, = J where jg is the Fisher information of 0 (11) 


[which is the B (bestness) criterion]. Let us now consider the class C, of 
estimation {U,,} which admit an asymptotic representation of the form: 


U, -9 = 0D voz) + fn 2), as n — o, (12) 


where the score function w4(- ) may depend on the method of estimation and the 
model; Egig(X,) = 0 and Egw3(z;) = 07, < œ. Recall that for a BAN estimator 
of 6, we would have a representation of the form (12) where Jy. v,(z,;) = 
fy( 2s 0)/f(z;, 9), f(-) is the probability density function and fg is its first order 
derivative w.r. to 0. Note further that Fo{[fp(2,; 0)/ Rt; 0)]?} = 39, so that for a 
BAN estimator, Eg{v¢(z,)fo(2,)/A2; 9)} = 1, V 6. Thus, if we let 


En = nay (0/08) log KX; 9), (13) 


then for a BAN estimator T,,, we have under the usual regularity conditions that 


as n — oo, 
1 ti 1 
n*(T,, - 9), En a N| (0, 0), i 4 : (14a) 
0 


Consider now the class C? of all estimators {U,,}, such that as n — 00, 


1 oy 1 
(2 U,,- 9), & E = sfo 0), 14 } (14b) 
6 


where ot, > Pa and the equality sign holds whenever U, is a BAN estimator of 
0. Note that the \n-consistency of U, entails the unit covariance term. As such, 
by an appeal to Theorem 1 of Sen (1986) we conclude that the BAN estimator 
satisfying (14a) is asymptotically (as n — œœ) a Pitman-closest estimator of 0 
(within the class €°). 

Note that this characterization is localized to the class of asymptotically 
normal estimators. In the context of estimation of location (or simple regression) 
parameter, incorporating robustness considerations (either on a local or global 
basis), various other estimators have been considered by a host of workers. 
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Among those, the M-, L- and R-estimators deserve special mention. The M- 
estimators are especially advocated for plausible local departures from the 
assumed model, and they retain high efficiency for the assumed model and at the 
same time possess good local robustness properties. The R-estimators are based 
on appropriate rank statistics and possess good global robustness properties. ZL- 
estimators are based on linear functions of order statistics with a similar 
robustness consideration in mind. In general, these M-, L- and R-estimators 
satisfy the AN condition in (10) through appropriate representations of the type 
(12), where p(z) = y(x- 0); see for example, Sen (1981, Ch. 8). From consid- 
erations of bestness based on the minimum (asymptotic) MSE, the optimal M-, 
L- and R-estimators all satisfy the bestness condition in (11). Hence, we conclude 
that an M-, L- or R- estimator of 0 having the BAN character in the usual sense 
is also asymptotically Pitman-closest. This places the PCC in a very comparable 
stand in the asymptotic case. Note that being a completely distributional 
measure, the PCC does not entail the computation or convergence of the actual 
MSE of the estimators, and hence (14a) requiring the usual conditions needed for 
the BAN property, also leads to the desired PCC property. 

We consider now some recent results on PCC in the sequential case (Sen, 
1989a). Note that for the estimation of the mean of a normal distribution with 
unknown variance a”, generally a sequential sampling plan is advocated to ensure 
some control on the performance characteristics (which can not be done in a fixed 
sample procedure). In this setup, the stopping number N is a positive integer 
valued random variable such nat for every n > 2, the event [N = n] depends 
only on { si, k < n}, where s? is the samipi variance for the sample size k, k > 
2. It is known that {X,, k > H and {s}, k < n} are mutually independent, 
and hence, given N = n (i.e., the si, k < n), T, = X,, satisfies the conditions of 
Theorem 1, so that X y has the Pitman-closest characièr This simple observa- 
tion can be incorporated in a formulation of the PC characterization of sequential 
estimators. Let {X, i > 1} be a sequence of independent and identically 
distributed (i.i.d.) random variables (r.v.) with a distribution function (d.f.) 
F(z) z E€ R,0 € © C R. For every n > 1, consider the transformation: 

(X,,..5X,) > (Ta (Y,, could be vacuous) (15) 


Va W) 


W,, is a (n-k-1)-vector and V,, is a k-vector, where k is a nonnegative integer. 


Let glo) and Pyy be the sigma sub-fields generated by T, and W,,, respectively, 
for n > 1, 


[N = n] is %(")-measurable, (16) 
T,, is MU for 9, (17) 


pe n) is gÇ- measurable and 
and W, are independently distributed. (18) 
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As in Theorem 1, let C? be the class of all (sequential) estimators of the form 
Uy = Ty+ Zy. Then, under (16), (17) and (18), 


P{| Ty-9| < | Uy-9|]} > 1/2, V Uy E€ andô E€ O. (19) 


A similar extension of Theorem 2 to the sequential case works out under (16) - 
(18). 

The characterization of PC of sequential estimators made above is an 
exact one, in the sense that it holds for an arbitrary stopping number (JN) so long 
as N satisfies (16). In the context of bounded-width confidence intervals for @ or 
minimum risk (point) estimation of 0 (and in some other problems too), the 
stopping number N is indexed by a positive real number d (i.e., N = Nj), such 
that N, is well defined for every d > 0 (and N, is usually | in d). In this setup, 
one considers an asymptotic model where d | 0. Often, there exists a sequence 
{n9} of positive integers (n9 is | in d), such that n9 — oo, as d | 0, and further, 


(no)? N & l,as d | 0. 


In such a case, we may extend the PC characterization to the class of BAN 
(sequential) estimators, without necessarily requiring (16). Consider the BAN 
estimators treated in (10) through (14), but now adapted to the stopping number 
{Nj}. Suppose that the U, [in (12)] satisfy an Anscombe-type condition 
[Anscombe (1952)] that for every € > 0 and 7 > 0, there exist a6 > 0 and an 
integer ngo, such that 


1 
Fatah etln- U,,| > + <n, Vn > no (20) 


This Anscombe-condition holds for the €, in (13) under no extra regularity 
conditions. On the other hand, (20) is also a byproduct of (weak) invariance 
principles for the U,, which have been studied extensively in the literature [viz., 
Sen (1981), Ch. 3-8]. Thus, we may replace {Uy p Ty 3 by {U o» T o} as 


d | 0, and then make use of (14) to characterize the desired PC property of the 
sequential BAN estimators. Note that, in general, M-estimators of locations are 
not scale-equivariant (so as to qualify for the class C in Theorem 1), and L- and 
R-estimators of location may not also belong to this class. Thus, in finite sample 
case, the PC characterization may not apply to these estimators. But, in the 
asymptotic case (sequential or fixed-sample size setup), the PC characterization 
holds in spite of the fact that these estimators may not belong to the class Č or 
that (16) may not hold. 

To sum up the main findings on PCC in the uniparameter case, we 
observe that the MU property (along with ancillarity and sufficiency) provide us 
with the desired tool for finding the Pitman-closest estimators in the fixed-sample 
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as well as sequential cases. In the asymptotic case, BAN estimators enjoy the 
PC-property, and this naturally raises the question: What is the relationship of 
the PCC and the (asymptotic) variance of an estimator? Following the lead of 
Rao et al. (1986) and Keating and Mason (1985), Peddada and Khattree (1986) 
studied this problem; however, their main results pertain to two estimators, say, 
T, and T,, which are distributed independently of each other, and hence, the 
conclusions derived from these results may not apply to an usual situation where 
the two rival estimators of a common parameter 0 are not independently 
distributed. Moreover, as they were assuming normality in most of the cases 
(treated by them), more general results for such models can be obtained from Sen 


(1986). 


PCC in the Multiparameter Case 


There has been a lot of research work on the PCC in the multiparameter 
case, including shrinkage and sequential estimators. Let us consider the case of a 
vector @ = (Ois 8p) of parameters, where 9 € © C R?, for some p > 1. Let 
Ta (Toast )’ be an estimator of 6. First, we need to extend the definition of 
the distance |T — 6| in (1) to the multiparameter case. Although the Euclidean 
norm is a possibility, since the different components of T may have different 
importance (and they are generally not independent), a more general quadratic 
norm is usually adopted. We may define 


ld = Qd, d € RP, (21) 


where Q is a given p.d. matrix. It is not uncommon to use some other metric 
(viz., entropy, etc.), so that we may as well take a general 


L(T, 9), satisfying the usual properties of a ‘norm’. (22) 


In the last section, in the context of estimation of dispersion matrices of 
multivariate normal distributions, we shall use such norms. As an extension of 
(1) and following the lead of Peddada (1985), we consider the following 
generalized Pitman nearness criterion (GPNC): An estimator T} is GPN closer 
than T, if 


Pol L(T, 9) < Ta 9) > 1/2 Y9 € ©. (23) 


In the context of multivariate location models and in other situations 
too, it is quite possible to identify a class of estimators similar to that in 
Theorem 1. However, this would rest on plausible extensions of the notion of 
median unbiasedness in the multiparameter case. Since the components of T 
may not be all independent and Q in (21) may not be a diagonal matrix, the MU 
property for each coordinate of T may not suffice. For our purpose, under (21), 
it seems that the following definition of multivariate MU property may suffice. 
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We say that T is MU for 9, if 
e'(T - 0) is MU for 0, for every £ € R?,0 E ©. (24) 


In passing, we may remark that if T has a distribution diagonally symmetric 
about @, then (24) holds, although the converse is not necessarily true. Recall 
that T has a diagonally symmetric d.f. around @ if T — 9 and @ — T both have the 
same d.f. 


Theorem 3. 


Let T be a MU-estimator of @ [in the sense of (24)], and let C be the class 
of all estimators of the form U = T + Z, where T and Z are independently 
distributed. Then for any arbitrary p.d. Q, 


~~ ~~ ar 


PAIT-8lo < IV- 82lo} > 1/3 Y8EQ yee (25) 


The proof is simple (Sen, 1989a) and is omitted. As a ampe example 
illustrating (25), consider the case where Xj,...,X, are i.i.d. r.v.’s having the 
multinormal distribution with mean vector 9 and diapersion matrix X}. Then T, 
=N 2 14; is MU in the sense of (24). Further, for known Ẹ, T, is sufficient 
for 6, and the class Č consists here of all estimators of the form T, + Z, where Z, 
is ancillary; this rests on the group of affine transformations X; —> a + BX; B 
nonsingular and a arbitrary. Thus, by Theorem 3, within the class of such 
equivariant estimators of 9, the sample mean T, (MLE) is the Pitman-closest 
one. By using the classical Helmert transformation for the multivariate normal 
vectors, it can be shown that the conclusion remains true in the case of unknown 
(but nonsingular) ©. Moreover, the interesting feature of this example [or (25)] is 
that the construction of T or the class C does not depend on Q in (21). In the 
multiparameter case, we shall study the GPNC for the Stein-rule or shrinkage 
estimators, and in that context, it will be seen that neither these estimators 
belong to the class C nor their dominance may hold for all Q (i.e., for a given Q, 
the construction of PC T, may generally depend on Q, and this T, may not 
retain its optimality aimultancouely for all Q, possibly different from the adapted 
one). For the time being, we refrain ourselves from generalizing Theorem 2 to 
the vector-case; we shall make comments on it in the last section. Perhaps, it 
will be to our advantage to discuss the sequential analogue of Theorem 3, i.e., a 
multi-parameter extension of (19). Let us consider the same model as in (14) - 
(18) with the exception that in (15), T„ is a vector and in (18), Z, is a vector 
too. Then the following result is proved in Sen (1989a): 

Under (16), (18) and (24), for the class C? of (sequential) estimators of 
the form Uy = Ty + Zy we have 
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Pol Tw-8llo < Uw Slot = 1/3 VEE OVE C, (26) 


for any arbitrary (p.d.) Q. 

Again as an illustration, we may consider the multinormal mean vector 
(9) estimation problem when the covariance matrix (2) is arbitrary and 
unknown. Ghosh, Sinha and Mukhopadhyay (1976) and others have considered 
suitable stopping numbers (N) which are based solely on the sample covariance 
matrices {5,; n > p}, so that (16) and (18) hold (for T, = X,, n > 1). 
Further, (24) follows from the diagonal symmetry of the d.f. of X, (around @), V 
n > 1. Hence, (26) holds. 

Let us next consider the asymptotic case parallel to that in the previous 
section. As in (10) — (11), a BAN estimator T, is characterized by its asymptotic 
(multi-) normality along with the fact that the dispersion matrix of this asymp 
totic distribution is equal to Io , where dg i is the Fisher information matrix. The 


representation in (12) also extends readily to this multiparameter case, and (13) 
relates to a stochastic p-vector which has the dispersion matrix dg: Consider then 
the class C° of all estimators { U,,} for which 


; (27) 


Im wo 


where vy — Jg is positive semi-definite, and the yn-consistency of U, entails the 
identity matrix J in (27); for a BAN estimator T, vy = Jp- Finally, in (21), it 
seems quite appropriate to let Q = dg. Then, by Theorem 1 of Sen (1986) we 


conclude that within the class C? of estimators which are asymptotically multi- 
normal and for which (27) holds [with Jg» being replaced by the asymptotic 


dispersion matrix of n!/ 2 U„-— 9)], the BAN estimators are Pitman-closest with 
respect to the norm in (21), where Q = jg. 


The interesting feature is that we are no longer restricting ourselves to 
the class Č of estimators (which are generally equivariant), but the Pitman- 
closest property depends on the adaptation of Q = dg. For an arbitrary Q, this 


property may not hold. The asymptotic theory of Pitman-closeness of sequential 
estimators runs parallel to that in the concluding part of last section, and hence, 
we do not repeat these details. 

In multiparameter estimation problems, the usual MLE may not be 
admissible (in the light of quadratic error loss functions). Stein (1956) considered 
the simple model that X has a multi-normal distribution with mean vector @ and 


dispersion matrix, say, ly» for some p > 1. He showed that though X is the 
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MLE of @ for all p > 1, it is inadmissible for p > 3. James and Stein (1962) 
constructed a shrinkage version which dominates X in quadratic error loss. 
Sparked by this Stein-phenomenon, during the past twenty-five years, a vast 
amount of work has been done in improving the classical estimators in various 
multiparameter estimation problems by suitable shrinkage versions; these 
improvements being judged by the smallness of appropriate quadratic error loss 
function based risks. Coming back to the multivariate normal law, such 
shrinkage or Stein-rule estimators do not belong to the class C considered in 
Theorem 3! Thus, the characterization of PC made in Theorem 3 is not appli- 
cable to such shrinkage estimators. This raises the question: Does the usual 
Stein-rule estimator have the PC property too? The answer is affirmative in a 
variety of situations, and moreover, this PC dominance may hold even under less 
restrictive regularity conditions. 

Rao (1981) initiated renewed interest in the PCC by showing that some 
simple shrinkage estimators may not be the Pitman closest ones! He actually 
argued that the usual quadratic error loss function places undue emphasis on 
large deviations which may occur with small probability, and hence, minimizing 
the mean square error may insure against large errors in estimation occurring 
more frequently rather than providing greater concentration of an estimator in 
neighborhoods of the true value. Since, typically, a Stein-rule estimator is non- 
linear and may not have (even asymptotically) multi-normal law, Rao’s criticism 
is more appropriate in this context. Actually, Rao, Keating and Mason (1986) 
and Keating and Mason (1988) have shown by extensive numerical studies that 
for the p-variate normal distribution, for p > 2, the James-Stein estimator is 
closer (in the Pitman sense) than the MLE X. The quadratic error loss criterion 
may also cause some difficulties in the usual linear models when the incidence 
(design) matrix is nearly singular; in such a case, a ridge regression estimator is 
generally preferred. In this context too, one may enquire whether such ridge 
regression estimators have the Pitman closeness property. This issue has been 
taken up by Mason, Keating, Sen and Blaylock (1990), and both theoretical and 
numerical studies are made. So long as the incidence matrix is non-singular, a 
ridge estimator may not dominate the classical least square estimator in the 
PCC, although it fares well over a greater part of O. The lack of dominance 
arises mainly due to the fact that as @ moves away from the pivot, the 
performance of a ridge estimator may deteriorate, so that the inequality in (23) 
may not hold for all @ belonging to ©, although it generally holds for all 
0: ||2|| < C, where C is related to the factor k ( > 0) arising in the construction 
of a ridge estimator. Their study also covers the comparison of two arbitrary 
linear estimators in the light of the PCC. 

The interesting fact is that the PCC may not even need that pis > 2 
(comparable to p > 3 for the quadratic error loss)! Even for p = 1, 
X ~ N(0, 1), Efron (1975) showed that for 


§=X-A(X) A(z) = $ [min{z, O(-z)}}, z > 0, (28) 
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[A(-z) = -A(z), £ > 0 and ®(-) is the standard normal d.f.], (1) holds for T} = 
6 and T, = X. He made some conjectures for p > 2. For the multivariate 
normal mean estimation problem, a systematic account of the PC dominance of 
Stein-rule estimators is given by Sen, Kubokawa and Saleh (1989). Consider first 
the model that for some positive integer p, X has a p-variate normal distribution 
with mean vector @ and dispersion matrix o?V, where V is known (and p.d.), 
while @ and g? are unknown. Also assume that s? is an estimator of oĉ, such 
that (i) ms*/o? = XZ a r.v. having the central chi square distribution with m 


J 
(> 1) degrees of freedom (DF), and (ii) s* is distributed independently of X. [In 
actual application, X may be the sample mean vector or a suitable linear 
estimator (of regression parameters, for example) and s? is the residual mean 
square (with m = n- q, for some q > 1]. Keeping in mind the loss function in 
(21), we may consider a Stein-rule estimator of the form 


ba =I- AX DNX vot Vs, (29) 
where ¢(z, s*) is a nonnegative r.v. bounded from above by a constant c 
(depending on p) (with probability one), and || X ló v = xy! Q 1yv1x. Note 


that estimators of this type with a different bound for ¢(-) (and for p > 3) were 
considered by Stein (1981), and hence, we regard them as Stein-rule estimators. 
Then, we have the following result due to Sen et al. (1989). 


Theorem 4. 
Assume that p > 2, and 


0 < G(X, $) < (p—1)(3p + 1)/(2p), for every (X, $) a.e. (30) 


Then $4, given by (29), is closer than X in the Pitman sense [i.e., (23) holds for 
I, = Ob) T, = X and L(T, @) = I T- aa 


If ø? were known, then in (29) and (30), we would have taken 4(X, o?) 
instead of $(X, s*). In this sense, the classical James-Stein (1962) estimator is a 
special case of (29). We may take ¢(X, s*) = 4:0 < a < (p- 1)(3p + 1)/2p, 
and consider the following versions: 


ĝa = X - a| Xl, y? Y’, (31) 
t = X- minfa] X13 y XVI MIG y} TVX, (32) 


so that 6, is a James-Stein estimator and gt is the so-called posztive-rule version. 


Then again (23) holds with T} = $t, T, = 6,, LT, 0 = ||T- 8 lo and 
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0 < a < (p- 1)(3p + 1)/2p. Thus, the positive rule version dominates the 
classical James-Stein version in the light of the PCC as well. It may be remarked 
that for the quadratic error loss dominance, Stein (1981) had p > 3 and 0 < a 
< 2p - 2), while here p > 2 and 0 < a < (p-1)(3p+ 1)/2p. For 
p E [2,5], (p-1)(3p + 1)/2p > 2(p- 2). For p > 6, in (30), we may as well 
replace (p — 1)(3p + 1)/2p by 2(p — 2). The main motivation of the upper bound 
in (30) was to include the case p = 2 and to have a larger shrinkage factor for 
smaller values of p. 

The proof of Theorem 4 depends on some intricate properties of 
noncentral chi square densities which may have some interest on their own. 
Basically, to verify (23) for T} = b4 and J, = X, it follows through some 
standard steps that a sufficient condition is 


Pixa >AtexZb > 1/2, VA>0m>1~p>2% — (33) 


where c = (p — 1)(3p + 1)/(4pm), es x has the noncentral chi square d.f. with p 
DF and noncentrality parameter A (> 0), and x2, has the central chi square d.f. 
with m DF, independently of XA . The trick was to show that the left hand side 
of (33) in A (> 0) and that as A — ov, it converges to 1/2. Sen et al. (1989) 
also considered the case of X ~ N,(9, E) Y arbitrary (p.d.), 
S ~ Wishart(Z, p, m) independently of X with m > p, and considered the 
usual shrinkage estimator 


6% = X- (m- p+ 1y'4(X, drl X Q SX, (34) 


where d,, = Chmin(Q $) and ¢(z, $) has the same bound as in (30). Then, for 


every p > 2, (23) holds for J, = 6% and T, = X. 

Let us now consider nie asymptotic picture relating to the Stein-rule 
estimators under the PCC. Generally, we have a sequence {T,} of estimators, 
such that as n — oo, 


AD, - 9) = N,(0, E), B pd. (35) 
and, also, we have a sequence {5,} of stochastic matrices, such that 
S,, — X, in probability, as n — oo. (36) 
Thus, a suitable test statistic for testing the hypothesis of a null pivot is 
L, = nT SiTy (37) 
so that an asymptotic version of (34) is 


b SO a Saa © S, Ta (38) 
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This form is of sufficient generality to cover a large class of {T,,}, both of 
parameter and nonparameter forms. In particular, for R- and M-estimators, for 
£ in (37), instead of T, suitable rank or M-statistics may also be used. Also, in 
(38), a null pivot has been used; the modifications for a general 9, are straight- 


forward. Now, if @ # 0, then n!L, + 8'919, as n — co, so that £1 + 0, as 
n — oo. Thus, for any fixed 0 Æ 0, 


Ta- Balg > 0 as n= oo eo) 


so that asymptotically the Stein-rule version becomes stochastically equivalent to 
the classical version. For this reason, the asymptotic dominance picture has been 
considered in the case where @ belongs to a Pitman-neighborhood of the assumed 
pivot (0). Thus, we may consider a sequence {K,} of local (Pitman-) 
alternatives 


1 
K,:9= Oin) = n2) à € R. (40) 
Further, by virtue of (36), we may replace $, by Ẹ, and appeal to Theorem 4 
(where s? is taken as 1 and V = Ẹ). As such, we obtain that for every ¢(-), 
satisfying (40), 


inalo -ele < IT- ele 


K} > 1/2. (41) 


Thus, the usual robust and nonparametric Stein-rule estimators enjoy the Pitman 
closeness property in the asymptotic case (and for Pitman-alternatives) under less 
restrictive regularity conditions (than in the conventional case of quadratic error 
losses). 

Let us now consider sequential Stein-rule estimators and discuss their 
dominance in the light of the PCC. Consider a simple model: {X;, 1 > 1} are 


iid.r.v. with N,(9, oI.) d.f; @ and ø? are unknown. Let s2 = 


(np) E (X; - a) x(X;- X,); X, = n De X; and consider a stopping 
> 


number N, such that for every n 2, [N = n] depends only on {s?, k < n}. 


Let then 


h= f1- (MZP) } Bn (2) 


where 


0 < b < (P-1)(3p + 1)/(2p), p > 2. (43) 


We may even allow b to be replaced by (Xy, s4), where ¢(-) satisfies (40). 
Again note that [N = n] © [s, k < n], so that by virtue of the independence of 
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{X,,} and {s2}, given [N = n], X, has a multinormal distribution (8, 1s), inde- 
pendently of the sz, k > 2. However, the shrinkage factor CG X why?) in 


(42) depends on all the r.v.’s (N, Xy and sî). Hence, the simple proof for (26) 
may not be adaptable in this more complex situation. Nevertheless, it has been 
shown by Sen (1989a) that by virtue of certain log-concavity property of the 
noncentral chi square density and the non-sequential results in Sen, Kubokawa 


and Saleh (1989) the following result holds. 


Theorem 5. 


For the class of Stein-rule estimators in (42), whenever the stopping 
number N satisfies (16) [with W, = (s%,..., 8%), n > 2], for every b € 
(0, (p - 1)(3p + 1)/2p], 


Pat dh Elle < Zv- 2l} = 1/2 Y 80. (44) 


In passing, we may remark that a parallel dominance result under a 
quadratic error loss has been proved by Ghosh, Nickerson and Sen (1987). In the 
fixed-sample size case, the PC dominance of §* in (44) has been established for 
an arbitrary (p.d.) ©. On the other hand, for arbitrary X, the sequential case 
either in terms of the PCC or a quadratic error loss has not yet been resolved. 

The asymptotic theory of sequential shrinkage estimation in the light of 
the PCC has been worked out systematically in Sen (1987a, b; 1989c, d). The 
basic idea is to incorporate (19) for the proposed stopping rules, verify (20) as 
amended in the multivariate case, and then by appeal to (35) through (41) 
completing the proof. Although, in the cited references, suitable quadratic error 
losses were used, our (35) through (41) ensure that the results remain adaptable 
in the PCC as well. Further, in this asymptotic setup, the covariance matrix X 
can be quite arbitrary (p.d.). In the case of a quadratic error loss, the actual 
asymptotic risk functions were replaced by asymptotic distributional risk 
functions, so that the desired dominance results could be obtained under less 
restrictive regularity conditions. In the case of PCC, this replacement makes no 
difference in the asymptotic picture, and therefore, there is no need to assume 
additional regularity conditions under which the asymptotic limits of the actual 
quadratic error loss based risks exist. In the case of shrinkage estimation, there is 
a technical problem in finding an asymptotically optimal stopping time, and this 
has been discussed in detail in Sen (1989d). 


GPNC and Estimation of a Dispersion Matrix 


To motivate, let us consider the problem of estimating the dispersion 
matrix 4 (p.d. but arbitrary) of a multinormal distribution. An unbiased 
estimator of Ẹ is 9 = (n - 1)'D*,(X; - X,)(X; - X,)’, where Xj,...,X,, are 
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i.i.d.r. vectors and X, = n re ee. 6 Note that A = (n-1)S ~ Wishart(X, n-1, p). 
One possibility is to "take @ = vec(X) and the class C} of equivariant estimators 
T = vec(cA), c > 0, under the quadratic error loss function L(T, @) as in (21) - 
(23). But the natural appeal for such a quadratic error loss function is not so 
convincing in this setup, and other forms of loss functions have been considered 
by various workers (viz., Haff, 1980, Sinha and Ghosh, 1987, and others). A 
popular choice is the so-called entropy loss function: 


L(S, Y) = tr(SE") - log | SO" - p; (45) 


a second one 


L(S, X) = (SF? - D° (46) 


also deserves mention. [For the estimation of the precision matrix ger 9 lisa 
natural choice, and in (45) or (46), we may replace S and X! by S$! and Y, 
respectively.] Consider the class of estimation (C,) of the form 


{cA:c > Oand(n-1)$ ~ W(X, n-1, p)}. (47) 


Also, consider the GPNC in (23). Then the following result is due to Khattree 
(1987). 


Theorem 6. 


Let 0 < ay < a, < land qA € C, ¿= 1, 2. Also, let c, ,, = 
med xy n~1)}- Then aÅ > gpy 4A under the loss function in (45) if and only 
i 


P log(a,/ a2) > (a, = a3) Cy n° (48) 

Also, let cn = med{r,} where r, = [tr(WW’)] + [tW] and W ~ 
Wishart(J, n-1, p). Then, under (46), aA > GPN %4 iff 

Con < 2a, + agy. (49) 

Thus, if we let a9 = p/c,, and aœ = 1/ Cea then within the class C, of 


estimators of X, ajA (or a5 A) is a unique best (in the GPNC sense) estimator of 
XZ under the entropy loss [or (46)], and this can not be improved within this class 
C. 

It may be noted that C, is the class of equivariant estimators under the 
(full affine) group of transformations: 


X — a + BX, A — BAB’, B nonsingular, a arbitrary. (50) 


Sinha and Ghosh (1987) and Sinha (1988) also considered a larger class C, of the 


form: 
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Cz ={TQT: A= TT ~ WŒ, m1, p); 
Q S Diag( qis.. 4p), q; > 0, for j = Lye 9ph, (51) 


and established the inadmissibility of the class C, relative to the class C,, under 
various loss functions. A natural question arises in this context: Are the 
estimators in the class C, admissible in the GPN sense? To address this problem 
properly, we may note that the entropy loss in (45) was first introduced in the 
univariate case by James and Stein (1961); in this special case, C} = C, contains 
the class of scalar multiples of the sample variance, and hence, the PC of an 
estimator can as well be judged by using the usual quadratic error loss. This was 
accomplished by Ghosh and Sen (1989) (from the PCC point of view). This 
equivalence result does not, however, hold generally for the multivariate case, and 
hence, a different approach is needed. The class C, is too big, and although for 
suitable subclasses of C, (defined by imposing additional partial ordering), 
admissibility of estimators in the GPN sense can be established, such a result 
may not generally hold for the entire class C,. This is being explored in detail 
(viz., Sen, Nayak and Khattree, 1990). The following results are worth 
mentioning in this context: 


(i) Within the class C,, no estimator of Ẹ is GPN-optimal! 


(ii) Let D, = Diag(doy...,d2,) with d3} = med(X? 4 7-24)» for 
J= 1,...,p, and let D = TDT. Also, let 


C3 = {A € C,: Q- D, = positive semi-definite (p.s.d.)}; (52) 
C4 = {4 € C: D3- Q= p.s.d.}. (53) 


Then, within the subclass Č}, D is GPN-optimal. Within the 
subclass C,, no estimator of X is GPN-optimal. 


(iii) Let D, = Diag(d,)...,d,,) with dy; = (n + p - 2j)", for 


j= 1,...,p, and let $, = TD,T’. Then, $ is the James-Stein esti- 
mator of X, and its properties have already been studied by Sinha 
(1988). The usual estimator of © is Ly =(n-1)'A. Then, 
although there is no GPN-optimal estimator of X within the class 
C., both A and D, dominate the classical estimator Lp in the 
GPN-sense. 
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Abstract 


We review the literature on unbiased estimation of some functions of the 
Bernoulli parameter p in the sequential case. Connections between the so-called 
efficient and inefficient sampling plans through the well known concept of 
sufficiency which have been explored recently are also presented. 


Introduction 


Under the set up of independent identical Bernoulli trials with parameter 
p, various aspects of unbiased estimation of a parametric function g(p) have been 
studied in the literature. Early works of Girshick, Mosteller and Savage (1946), 
Wolfowitz (1946, 1947), Lehmann and Stein (1950), De Groot (1959) and Wasan 
(1964) are devoted to some general results on sequential binomial estimation. 
Later works by Gupta (1967), Sinha and Sinha (1975), Sinha and Bhattacharya 
(1982) and Sinha and Bose (1985) deal with problems related to unbiased 
estimation of 1/p. Recently Bose and Sinha (1984) studied the connections 
between the so-called efficient and inefficient Bernoulli sampling plans through 
the well known concept of sufficiency of statistical experiments. 

Our object in this paper is to present a comprehensive review of most of 
the available results in this area. We omit proofs of all the results. However, 
detailed and exact references to various results are provided. 

The next section is devoted to setting up the notations, nomenclature, 
and definition of efficient sampling plans. In the third section, we provide results 
on efficient sampling plans. The problem of unbiased estimation of 1/p, which 
has received considerable amount of attention in the literature, is discussed in 
fourth section. In fifth section, we discuss the connection between efficient and 
inefficient sampling plans via the concept of sufficiency. Some concluding 
remarks are made in the last section. 


Notations and Nomenclature 


Let (Z, i = 1, 2...) be an iid. sequence of Bernoulli variates with 
P(Z; = 1) = p and P(Z; = 0) = 1 - p = q (say). We assume p € 2 C (0, 1). 
Any realization of this process can be exhibited as a lattice path in the (X, Y)- 
plane, where a particle moves from the origin one step to the right (along X-axis) 
if the incoming observation is 0 and one step above (along Y-axis) if it is 1. A 
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stopping rule can be viewed as a sequence of functions ¢,, where ¢, is a function 
of (Z,,...,2,). Each ġ, takes the value 0 or 1; given 2,...,2,, $,(2,.-.2,) = 1 
indicates that we take one more observation and ¢,(2,,...,2,) = 0 indicates that 
we stop at this stage. A point a = (z, y) is a continuation point if there exists 
one sequence of realization (2, 29,...,2,4y) leading to a such that $,(2,...,z) = 1 
Vj < c+ y. A point a = (z, y) is a boundary point if there exists one sequence 
of realization (21, 29;...,2,4) leading to @ such that $ (224%) =1Vi < tty 
and Pj(Z 5-12 +y) = 0, A point may be a boundary point or a continuation 
point depending on the path. A point is an accessible point if it is either a 
boundary point or a continuation point. Points which are not accessible are 
inaccessible points. For any boundary point a = (z, y), P(a) denotes the 
probability of stopping at a and is given by 


P(a) = pg” » {1 = Pap yl i sZn4y)h 
CURETA 
leading to (x,y) 


= K(a)p%¢ (say) (1) 


where K(q) is the number of accessible paths from the origin to the point a. 

A stopping rule yielding the boundary points together with their 
probabilities P(œ) shall be called a sampling plan P. We say that P is closed iff 

>> P(a) = 1 identically in p € Q, B denoting the set of all boundary points of 

aéB 
P. This refers to eventual termination with probability one. Only closed 
sampling plans are of interest to the practical experimenter and we shall assume 
so unless otherwise mentioned. 

Given a closed plan P, we say that a parametric function g(p) is 
unbiasedly estimable if there exists a function f(a) such that 


(p) = E,(fla)) = 2 fla) P(a), Vp € Q. (2) 


EB 


When (2) holds, f(a) is said to define an unbiased estimate of g(p) and it is a 
proper estimate of g(p) if fa) € range of {g(p): p € Q} for every a € B. 
Otherwise, it is said to be improper. We straightaway insist on non-negative 
estimability of g(p) (i.e., we demand a) > 0) whenever g(p) > 0, Vp E Q. 
The reasons for this shall be clear as we proceed. In the same vein, for unbiased 
estimation of 1/p, we insist that the estimate f(a) be proper viz., fa) > 1, Va 
E€ B. 


Remark 1 


Given an arbitrary sampling plan, examining its closure is not always an 
easy task. Consider plans having boundaries determined through two infinite 
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sequences of points (0, ay), (1, a1), (2, a,),... and (bp, 0), (b4, 1), (b2, 2),.... Here 
1 < @ < a4 < a... and 1 < bb < b < b < ... are two infinite 
sequences of positive integers. Such plans have been termed doubly simple (see 
Wolfowitz, 1946). For such plans, closure holds whenever lim inf A(n)/n < oo 


where A(n) refers to the number of accessible points of index n. However, an 
arbitrary unbounded sampling plan need not be doubly simple and, hence, the 
condition lim inf A(n)/Nii_ < oo can be substantially improved for other types 


of unbounded plans. As a matter of fact, plans with A(n) = 0(n) can also be 
closed. The point to be noted is that the actual value of A(n) is not always an 
important factor to decide on closure or otherwise of a plan. Once an accessible 
point is reached by a path, only the nature of the remaining part of the sampling 
plan ahead of this point is relevant for the path to hit a boundary point, and 
hence, to lead eventually to closure of the plan. The reader is referred to Sinha 
and Bhattacharya (1982) for examples of various types of unbounded closed plans 
and other details. The notion of a transformed plan due to Sinha and Sinha 
(1975) which is also relevant in this context is explained in the fourth section. 


Efficient Sampling Plans 


DeGroot (1959), under certain regularity conditions, established the 
validity of the Rao-Cramer lower bound for the variance of an unbiased estimate 
of any estimable parametric function g(p) based on a sequential sampling design. 
The concept of efficient sampling plans for unbiased estimation of g(p), as 
introduced by him, refers to a closed sampling plan P together with an unbiased 
estimate f(-) such that the sampling variance of f{-) attains its relevant lower 
bound (which of course depends on g(p) and the particular plan P). He observed 
that the only efficient sampling plans are the family of Inverse Binomials when 
g(p) is linear in 1/p. Of course, trivially the family of Binomials is also efficient 
when g(p) is linear in p. All other plans may be termed as inefficient. An 
efficient plan may be seen to maximize the efficiency per unit observation for all 
p € (0, 1). 

The sampling plans often studied in the literature implicitly (or 
explicitly) envisage that the decision to stop at a point (or continue) depends 
only on the point reached (rather than the path traversed in reaching that point). 
This leaves out a variety of plans obtained by quite interesting and practically 
suggested stopping rules. A quick example of such a plan is one in which we stop 
as soon as we obtain two consecutive successes (let us call this plan Plan P1). In 
this case there would be some boundary points which are exclusively so, namely, 
the points on the line Y = X + 2. There would be other points which would be 
continuation or boundary points depending on the path or route followed in 
reaching them. To differentiate the classical sampling plans from such plans, we 
shall call the former boundary point plans and the plans of the type P1 as route 
plans. These two types together form the class of all conceivable plans. 
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By easy modification of arguments in De Groot (1959) it can be shown 
that the Rao-Cramer bound remains valid for route plans. Moreover, in case the 
parameter space 2 is an open subset of (0, 1), the regularity conditions may be 
replaced by their local versions. These indicate that the only parametric 
functions efficiently estimable are of the form (a + bq)/(p — B4) (a, b being 
arbitrary real numbers and J being an integer > -1). These include p and 1/p 
in particular. The corresponding efficient plans are given by P(8, c) = 
{a = (x, y): y = 2B + c} with 2, c integer, c > 0 and 8 > 1. Such a plan is 
closed if q < 1/(@ + 1) when 8 > 0, and Vp € (0, 1) otherwise. These results 
have been derived recently by Dutta (1980), who designates such plans as 
Generalized Inverse Binomial Plans. 

As regards the inefficient plans, we demonstrate in the fifth section that 
a large number of them are indeed sufficient for the efficient plans. 


Sequential Unbiased Estimation of 1/p 


The special problem of sequential unbiased estimation of 1/p has been 
initiated in Gupta (1967) and since then treated extensively in the literature. 
The central problem has been to characterize all sequential sampling plans which 
provide unbiased estimation of 1/p. It may be noted that the analogous 
problems of unbiased estimation of 1/4, 1/pq, etc. can be handled in a similar 
way. 

Gupta (1967) stated a very simple sufficient condition for a sequential 
sampling plan P to provide an unbiased estimate of 1/p: 


(i) Sufficient condition: if the closed plan P with boundary B = {r; = 
(z; y;), i = 1, 2,...} be such that by changing its boundary points 
from r; to r'. = (£z; y; + 1), we get a closed plan P’ with boundary B’ 
= {r = (a, y; + 1), i = 1, 2,...}, then 1/p is estimable for the plan P. 
An unbiased estimate is given by f(r) = K'(r’)/K(r), r € B, where 
K'(r’) is the number of paths from the origin to r’ € B’. 


Sinha and Sinha (1975) studied the problem in a greater detail and, 
among other things, put forward the notion of a transformed plan which can be 
described as follows. For a given plan P with the set B of boundary points a, let 
(z’, y’) be an arbitrary but fixed point in the XY-plane. Then the transformed 
plan P?(z’, y’), corresponding to (z’, y’), with the set B?(z’, y’) of boundary 
points a7 (z', y’) is defined by the following three conditions: 


I. Every aT belonging to BT also belongs to B necessarily. 


H. The points {(z, y): z > 2, y > y'} constitute the totality of all 
points (accessible, boundary and inaccessible) of PT. 
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IHI. Every boundary point a € B is either a boundary point aT € BT or 
an inaccessible point in PT. 


Given the plan P with boundary points B, the rules for obtaining the boundary 
points af € BT are as follows: 


(a) if (2, y) € B, i.e., if a = (z’, y’), then aT = a is the only boundary 
point of B*; 


(b) if (2’, y’) ¢ B, then inf{a : a= (z, y), £ > 2}, for (z, y) € B, is 
the only point on ‘Y = y” that belongs to BT. 


(c) if(x', y) ¢ B, then inffa : a = (x', y), y > y'}, for (z', y) € B, is 
the only point on ‘X = z” that belongs to B*; 


(da) if (z’, y) ¢ B, any boundary point œ € B also belongs to BT if and 
only if it can be reached by a path from (z', y’). Otherwise, it is an 
inaccessible point of PT. 


It may be noted that whenever the point t = (2’, y’) is an accessible point of P, 
we have 


4st 
P= D apf 
a€ Bla! y') 
ie, l= ap Hg (3) 


aéB Ta! ,y') 


where t(a) = total ee of ways of passing from t to a only through the 
accessible points of P!(z’, y'). Even when t = (z’, y’) is an inaccessible point of 
P, we may use the above ee of (a) for alla € B(x’, y’). 

The transformed plan P/(z', y’) is defined to be closed only when the 
identity (3) above holds, no matter whether (2’, y’) is accessible or not. With 
reference to the problem of unbiased estimation of 1/p, Sinha and Sinha (1975) 
came up with the following separate necessary and sufficient conditions. 


(ii) Necessary condition: the sampling plan must be unbounded along the 
X(failure)-direction. 


(iii) Sufficient conditions: (a) if no point on the line Y = 1 is inaccessible, 
then 1/p is estimable. (b) let (zp, 1) be a first inaccessible point on 
the line Y = 1. If the transformed plan P Tto 1) is closed, then 1/p is 
estimable. 
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It has been demonstrated in Sinha and Sinha (1975) that the sufficient 
conditions (i) and (iii)(b) are equivalent, and conjectured that the sufficient 
condition (i) is necessary as well. In Sinha and Bhattacharya (1982), useful 
notions of finite-step and infinite-step generalizations of the Inverse Binomials 
have been introduced, and the following results have been deduced. See also 
Sinha and Bose (1985) in this context. 


(iv) All finite-step generalizations of the Inverse Binomials provide 
unbiased estimation of 1/p. 


(v) Every infinite-step generalized Inverse Binomial, whenever closed, 
provides unbiased estimation of 1/p. 


Incidentally, an infinite-step generalized Inverse Binomial plan is closed if 
and only if lim inf d(n)/n = 0 where (n — d(n), d(n)) is the coordinate position 
n—->CO 


of the boundary point on the line X + Y= n (n = 1, 2,...). For a proof, see 
Bhattacharya and Sinha (1982), Bose and Sinha (1984). 

The conjecture relating to a characterization of all sampling plans 
providing unbiased estimation of 1/p has been settled in the affirmative in Sinha 
and Bose (1985). The result is stated below. 


Theorem 1 


A plan P provides unbiased estimation of 1/p if and only if the plan P 
defined in the sufficient condition (i) is closed. 


Connections between Efficient and Inefficient Plans 


In this section, we demonstrate that a large number of inefficient 
sampling plans are indeed sufficient for the efficient plans. These results have 
been established in Bose and Sinha (1984). 

The concept of sufficiency in comparing statistical experiments is well 
known. Roughly speaking, an experiment F resulting in a random variable X 
having law of distribution F4(-) is said to be sufficient for another experiment F” 
resulting in a r.v. Y having law of distribution G,(-) if, given an observation z 
on X, it is possible to generate an observation y on Y using a known 
randomization procedure, i.e., a known law of distribution Z(-|z), which is 
independent of 0. If the above holds, we say that X is sufficient for Y and write 
X > Y. 

Clearly, when X > Y, it is enough to observe X to generate Y, if 
needed. Moreover, it is known (Blackwell and Girshick, 1954) that when 
X > Y, for any estimable parametric function g(0), given any unbiased estimate 
based on Y, one can construct an unbiased estimate based on X which is as good 
(in the sense of having equal or smaller variance). Applied to the present set up, 
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this would mean that any plan, whenever sufficient for a given inefficient plan, 
would provide smaller variance (but certainly larger ASN) than the latter. 

The following general results on comparison of sampling plans for 
sufficiency consideration are interesting and illuminating. We consider two 
arbitrary closed plans P* and P, and state conditions under which P* > P. In 
which follows B*(B) denotes the set of boundary points of P*(P). We also 
assume that each of Q(P) and Q(P*), the parameter space for closure of P and 
P* is the entire interval (0, 1). 

Before we state the results we mention the notion of completeness in this 
context. Writing P*(a*) = K*(a*)p!"q"" for a* = (z*, y*) € B, a plan P* is 
said to be complete if >> f(a*)P*(a*) = 0, Vp € Q implies a*) = 0, Vat € 

a* € B* 
B*. The following result (necessity due to Girshick, Mosteller and Savage (1946), 
sufficiency due to Lehmann and Stein (1950)) gives a characterization of such 
plans which are useful in the sequel. 


Theorem 2 
A plan P* is complete if and only if the following hold: 


(a) The plan is simple (i.e., the continuation points of P* on the line 
X + Y= n form an interval, V n > 1). 


(b) The removal of any boundary point destroys closure of the plan. 


Following Bose and Sinha (1984), a series of results can be stated. 


Theorem 3 


i) A necessary condition for P* > P is that for every a = (z, € B, 
y 
pq" is estimable under P*. 


(ii) If P* is complete, then (i) ensures that P* > P. 


Bose and Sinha (1984) observed that if P* is not complete, then the 
estimability of p%q” under for every œ € B may not necessarily yield 
P* > P. They also noted that the completeness of P* is not necessary for it to 
be sufficient for P. 

It is clear from the above result that the estimability of p%q” for an 
arbitrary point a = (x,y) € B of P with reference to P* arises naturally. 
Wolfowitz (1946) established its estimability in case œ is an accessible point of 
P*, though pf may be estimable even otherwise. The following theorem 
provides a necessary condition. 

In what follows, a point œ is defined to line below a* if œ lies in the 
rectangle formed by the two axes and the point a*. A point a* lies above a if it 
lies in the positive quadrant formed by a as the origin. 
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Theorem 4 


A necessary condition for estimability of pg? under a plan P* is the 
existence of at least one boundary point of P* above (z, y). 
As a consequence, we have the following corollary on necessary conditions 


for P* to be sufficient for P. 


Corollary 1 


Two necessary conditions for P* to be sufficient for P are: 
(i) Forevery a € B,Ja* € B* above a. 
(ii) For every at € B*,Jaœ € B above at. 


However, as noted in Bose and Sinha (1984), (i) and (ii) together with 
even T of p”, V a € B, are not enough to assert > P. 

We now state a sufficient condition for the estimability of pq” under a 
plan P* based on the notion of transformed plans as explained in the last section. 
Treating (z, y) as the origin, we can derive a transformed form of P* to be 
denoted as P**(z, y). In this new plan, the paths emerge from the new origin, 
and get merged into accessible points or escape them. Note that if (z, y) is itself 
a boundary point of P*, the transformed plan does not get started at all. Clearly 
the set of boundary points of P* above (z, y) is regarded as the set of boundary 
points of P**(z, y). 


Theorem 5 


Whenever P**(z, y) is closed, p%q” is estimable. 

We conclude this section with another simple sufficient condition for P* 
> P. Let K**(a) be the number of accessible paths of P* from origin to a 
without hitting any other a’ € B, leading to a as a continuation point of p* 


Theorem 6 


K(a) = K**(a), Va € Bimplies P* > P. 
Specialized to the problem of obtaining plans sufficient for the Binomials, 
we have the following results. 


Theorem 7 


(a) A closed plan P* is sufficient for the Inverse Binomial plan P(0, c) if 
and only if there exists no boundary point of P* below the line Y = c. 


(b) A closed plan P* is sufficient for the fixed Binomial plan of size n if 
and only if there is no boundary point of P* below the line X + Y= 
n. 
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As a consequence of (a), we have the following result for the First Waiting Time 


plan. 


(c) 


A plan P* with no boundary points on the X-axis is sufficient for the 
plan P(0, 1). Only such plans are sufficient for P(0, 1). 


Concluding Remarks 


(i) 


(ii) 


(iii) 


References 


In a recent paper, Bhandari and Bose (1989) have derived conditions 
on the nature of unbiasedly estimable functions g(p). They have 
demonstrated that g has to be continuous if it is unbiasedly estimable. 
Further, if g is nondifferentiable, then it is not unbiasedly estimable by 
a bounded estimator with finite expected stopping time for all p. This 
shows that g(p) = min(p, 1 — p) is not estimable by any finite (or 
bounded) sampling plan though there are plenty of unbounded 
sampling plans useful for this purpose. An open problem in this 
context is the following: 


Does there exist any proper unbiased estimate 
of min(p, 1 — p)? 


The following problem is also of considerable interest. Fix an integer 
n and consider the class of all Bernoulli sampling plans P such that for 
boundary points of the type a = (zx, y), Í w(x + y)dy(p) < nfora 
prior distribution y(-) on p. Does there exist a sampling plan in this 
class which is the best for estimation of p? Here bestness refers to 
minimum prior expectation of posterior variance. In particular, one 
would be curious to know if the Binomial plan is the best for all or 
some priors 7(- ). 

By a slight modification of the above problem, we may as well 
search for the best plan among those for which Ep(z + y) < n, V p. 
Bhandari et al. (1989) have obtained some partial results in this 
direction. 


Rustagi (1975) has studied some aspects of estimation of p in the 
simple Markovian set up. Following this, Sinha and Bhattacharya 
(1982) initiated a study in the dependent set up in the context of 
sequential estimation. Further research is needed in this area. 
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Introduction 


We undertake with this title a brief survey of various definitions of 
sufficiency, with some of their properties and relationship between them. 

Works on this theme are found in a sequence, if not so much as a stream, 
of developments from the sixties through eighties. We consider such works as 
attempts at mathematical conceptualization of the statistical notion of 
sufficiency, and try to examine how far they have been successful in capturing the 
intuitive and logical content of the notion. Emphasis has been naturally put on 
the more recent developments, but some earlier results had to be touched upon as 
long as they make a part of historical or logical background. 

A reason for this choice of a theme is that sufficiency today is not as 
prolific a subject as in early days, making it difficult to draw a recent trend out 
of the publications in last few years. Only a few titles with the word sufficiency 
appear each year in Current Index to Statistics, mostly with their main interest 
in neighboring though closely related subjects, e.g., ancillarity, information and 
comparison of experiments. They will be better treated separately under the 
respective titles, rather than thrown together into such a short survey as this one. 

Out of the remaining papers in sufficiency proper, being still fewer in 
number, we could pick out some fairly recent results to form an additional 
section on Basu Theorems. 

Neither a monograph nor a bibliography on this subject recently came 
into our attention. So the early bibliography by Basu & Speed (1975) as well as 
the survey Partial sufficiency (Basu, 1978) is still partially sufficient (at least) to 
a reader. 


Statistical Notion and Mathematical Definitions 


Sufficiency as a statistical notion means the property of a statistic 
retaining all the relevant information contained in the whole sample. As is well 
known, it first appeared in Fisher (1920) (see Stigler, 1973, for historical back- 
ground) which pointed out that an estimate of a parameter can be regarded to 
sum up the whole of the information respecting the parameter which a sample 
provides if, for any of its given value, the conditional distribution of any other 
estimate is independent of the parameter. This idea of expressing the notion by 
means of conditional probability developed into Fisher’s (1922) first definition of 
sufficiency. A statistic T is called sufficient if: 
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(A) The conditional distribution of the sample when given T does not 
depend on the parameter. 


Subsequently various aspects of the same notion concerning with specific 
class of inference and decision problems found different expressions in the 
following definitions. A statistic T is called sufficient if: 


(B) The distribution of the sample can be reconstructed from that of T 
through randomization, or, mathematically, a stochastic kernel 


(Blackwell sufficiency, Blackwell, 1951). 


(C) For every decision problem, given a decision function based on the 
sample, there exists a decision function based on T which is at least as 
good as the former (Decision sufficiency attributed to Bohnenblust, 
Shapley and Sherman. See Blackwell, 1951). 


(D) For any prior distribution of the parameter, the posterior is a function 
of the sample through T (Bayesian sufficiency, Kolmogorov, 1942). 


Meanwhile the Definition (A) underwent  measure-theoretic 
sophistications through Halmos & Savage (1949) and Bahadur (1954) giving rise 
to the following: 


Definition 1 


Let E = (X, A, P) be a statistical experiment and & be a subfield 
(more precisely, a sub-o-field of A). B is called sufficient if for every A in 
A there exists a B-measurable function P(A/B)(z) which satisfies, for all B 
in B and p in P, 
WAN B) = | ,P(A/B)(a)dp. 


A statistic is called sufficient if the subfield induced by it is sufficient. 


Notice that this is more general than (A), as it applies to subfields in 
general, including in its scope those subfields which are not induced by a statistic. 
Also, it allows the cases where P(A/%)(z) is not a measure on A. Though 
P(A/B)(z) is called conditional probability, it is not guaranteed to be a measure 
by the Radon-Nikodym Theorem, on which this definition is based. 

This is the standard definition of sufficiency, most commonly used at 
present. We also will adopt it here, but will refer to it as Sufficiency, with the 
initial capital S, so as to avoid confusion. Subfield versions are available also for 
all other definitions. They are to be understood whenever references are made to 
the definitions. 

The very general and measure-theoretical way in which Sufficiency is 
defined made it possible to prove many useful results with full rigour and under 
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the widest possible conditions. In particular, it implies the conditions (B) and 
(D) without any restrictions, while (A) and (C) easily follow in the cases where 
regular conditional probabilities exist. 


Dominated Case 


The success of Sufficiency was especially remarkable in the dominated 
case. Eis called dominated if there exists a o-finite measure m on A wrt. which 
each p in P has a density dp/dm. In this case, it follows that: 


1) X is covered by a countable family of mutually disjoint subsets, called 
kernels, of the supports S(p) = {z; dp/dm > 0} of measures p in P. 
Those measures constitute a countable subfamily P’ of P, which is 
equivalent to P. 


2) There exists a pivotal measure n, a convex combination of the 
measures in P’. Each p in P has a density wrt. n. 


3) A subfield B is Sufficient if and only if the density dp/dn is B- 
measurable for each p in P (Neyman Factorization Theorem). 


4) If a subfield includes another subfield which is Sufficient, then the 
former is also Sufficient. 


5) There exists the minimal Sufficient subfield, the smallest subfield wrt. 
which all the densities dp/dn, p € P, are measurable. 


The existence of the minimal Sufficient statistic is also proved under a 
slight additional restriction that P is separable wrt. the total variation distance 
(Lehmann & Scheffe, 1950). 

The term minimal Sufficient requires slightly technical clarifications. 
Burkholder (1961) proved that the following two properties of a Sufficient 
subfield B are equivalent to each other: 


i) B cC C [P| for every Sufficient subfield C, and 
ii) % C C[P] for every Sufficient subfield C such that C C B [P]. 


B is called minimal Sufficient when it has these properties. On the other 
hand, a Sufficient statistic is called minimal if it is a function of every other 
Sufficient statistic except on a P-null set which may depend upon the latter 
statistic. 

This minimality does not coincide with the minimality of the Sufficient 
subfield which the statistic induces. As a result, the minimal Sufficient statistic 
and subfield may not coincide with each other even when both exist. In this 
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connection, all logically possible kinds of counter examples are actually available 
(see Bahadur, 1955; and Landers & Rogge, 1972). 

In case Sufficiency is replaced by pairwise Sufficiency in i) and ii), then i) 
does not follow from ii), so that smallest pairwise Sufficiency and minimal 
pairwise Sufficiency have to be differentiated. 


Undominated Cases 


Thus in the dominated case Sufficiency exhibits all the good features to 
qualify itself for a mathematical embodiment of the statistical notion of 
sufficiency. However, it came to be known already around 1960 that some of the 
features are not carried over to the general case. Notably, general validity of 4) 
and 5) were disproved by the counter examples given by Burkholder (1961, for 4) 
and Pitcher (1957, for 5), respectively. The phenomena of the failure of 4) and 
5) are accordingly called Burkholder and Pitcher pathologies. 

Various intermediate conditions more general than domination have been 
devised in order to avoid these pathologies. Here we present two such conditions, 
namely, majorization and weak domination. Reader is referred to Luschgy & 
Mussmann (1985) for details of these and other conditions. 

An experiment E is called majorized if there exists a majorizing measure 
m on A wrt. which each p in P has a density dp/dm. 

E is called weakly dominated if the majorizing measure m is further 
assumed to be localizable (for the definition of localizability see Diepenbrock, 
1971, or Ghosh et al., 1981). 

The majorized case is more or less the most general case in which 
positive results are being obtained at present. The non-majorized cases are the 
places mainly for counter examples, but for some early, universal type of 
theorems by Bahadur (1954, 1955b), Burkholder (1961) and others. 

Weak domination is more general than domination, as localizability of a 
measure follows from o-finiteness, and is equivalent to some other conditions 
which appeared in literature, such as compactness (Pitcher, 1965), coherence 
(Hasegawa & Perlman, 1974), etc. 

There is a simple but suggestive special case of weak domination, called 
the discrete case. E is called discrete if X is an uncountable space, A is the power 
set, each p in P is a discrete probability and the only P-null set is the empty set. 
It is Professor D. Basu himself who pointed out with J.K. Ghosh (1967) that the 
problem of sampling from finite populations falls in this category and thus 
became one of the pioneers of the study of sufficiency in the undominated cases. 

These conditions have been only partly successful in removing the 
pathologies, insofar as the minimal Sufficient subfield was proved to exist in the 
weakly dominated case, but not in the majorized case in general (Pitcher, 1965, 
and Hasegawa and Perlman, 1974). Burkholder pathology persists even in the 
weakly dominated case. 

The reason for this difference between dominated and undominated cases 
becomes apparent if a parallelism to the passage from 1) through 5) is tried out 
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for the majorized case. It follows that: 


1^) X is now covered by an uncountable family of almost disjoint kernels. 
This family is called a maximal decomposition (Diepenbrock, 1971). 
As before, the kernels are subsets of the supports of measures p in P. 
Those measures constitute an uncountable equivalent subfamily P’ of 
P, 


2’) A pivotal measure n can be defined as the sum of the measures in P’ 
restricted to the respective kernels. 


3’) A subfield & is pairwise Sufficient and contains the supports S(p) for 
all p in P (pairwise Sufficiency with supports, abbreviated as PSS), if 
and only if the density dp/dn is -measurable for each p in P 
(Analogue of Neyman Factorization Theorem, Ramamoorthi & 
Yamada, 1982). 


4') If a subfield includes another subfield which is PSS, then the former is 
also PSS. 


5’) There exists the smallest subfield which is PSS (Ghosh et al., 1981). 


Thus, instead of Sufficiency in the dominated case, here we arrive at 
PSS, a property in between Sufficiency and pairwise Sufficiency. Notice further 
that the likelihood ratios are seen in 3’) to be functions of the sample through 
PSS rather than Sufficiency, which coincides with the former in the dominated 
case. 

On the other hand, if we insist upon retaining all the nice properties of 
Sufficiency, i.e. (B) through (D) as well as 4) and 5), we have to take to 
something even more restrictive than domination, as it would require a type of 
sample space with regular conditional probabilities. Barndorff-Nielsen (1978) 
points out it and puts forward one such framework: An Euclidean sample space 
with the Borel field and a dominated P, in which only B-suffictency (defined in 
terms of the existence of regular conditional probability P(A/T) common to all p 
in P) of statistics rather than subfields is treated. This would restrict us almost 
within the purview of Definition (A) and would mean little more than a return to 
Fisher’s old setup. 


Relationship Between the Definitions 


Much attention has been directed to the relationship between various 
definitions of sufficiency, especially on the question as to whether Sufficiency 
follows from other definitions. It is quite rightly so, as Sufficiency is defined 
solely in measure-theoretic terms and, unlike other definitions, is not directly 
concerned with specific statistical problems, though it was also originated in an 
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estimation problem. It is relevant to ask whether the requirement for Sufficiency 
is just appropriately strong, or actually stronger than the requirements for other 
definitions. 

We take up this question as regards decision, Bayes and, in addition, test 
sufficiency, as it is often called in literature. Blackwell sufficiency would require 
some preliminaries from the comparison of experiments which is beyond our 
scope. 

A subfield B is called test sufficient if for any test function there exists a 
%-measurable test function whose expectation is identical with the former for all 
pin P. 

It was proved in a series of classical results in Bahadur (1955a, b), 
Blackwell (1951), Kudo (1967) and others that each of the four concepts 
including Blackwell sufficiency implies pairwise Sufficiency. In the case of 
decision sufficiency we need some clarification on the precise definition of a 
decision problem, but we will not go into the details. In the dominated case as 
pairwise Sufficiency implies Sufficiency, each of the four concepts implies 
Sufficiency. 

Things are again very different in the undominated case. First, pairwise 
Sufficiency does not imply Sufficiency in general. Secondly, the implication, e.g. 
Bayes sufficiency implies Sufficiency obviously fails in the face of Burkholder 
pathology, as Sufficiency implies Bayes sufficiency and a subfield including Bayes 
sufficient subfield is Bayes sufficient. So the implication needs to be modified to 
a weaker statement: A Bayes sufficient subfield includes a Sufficient subfield. 
The same modifications are made also in regard to decision, test and Blackwell 
sufficiency. 

This modified statement is proved by Ramamoorthi (1980) for decision 
sufficiency: A decision sufficient subfield includes at least one Sufficient subfield 
in it. 

The statement a test sufficient subfield includes a Sufficient subfield is 
obviously more difficult to follow, as test sufficiency is weaker than decision 
sufficiency. Indeed, since the paper of Brown (1975) which says that it holds true 
for the discrete case, little progress has been seen, but for a recent proof of PSS 
does not imply test sufficiency for the weakly dominated case by Kusama & Fujii 
(1987). Even this statement, not at all surprising, cannot be readily proved for 
more general cases. 

The questions concerning Bayes sufficiency are even more technical, as 
Bayes sufficiency involves a measurable structure on P, and appears to be weaker 
than all other definitions. In the extreme case, it is no more than pairwise 
Sufficiency if P has the discrete o-field. It follows from test sufficiency if both X 
and P have countably generated o-fields, and from decision sufficiency in the 
general case (Ramamoorthi, 1980. Incidentally Blackwell sufficiency also follows 
from decision sufficiency). On the other hand, a rather natural counter example, 
in which P is a standard Borel space, is available to show that Bayes sufficiency 
does not imply Sufficiency (Blackwell and Ramamoorthi, 1982). 
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Then what does Bayes sufficiency at all imply? Suppose A and a 
subfield B are countably generated. Then & is Bayes sufficient if and only if it is 
Sufficient for almost all p in P wrt. every prior measure on P (Ramamoorthi, 


1980). 


LeCam’s Framework of [-Space and M-Space 


An entirely different approach has been proposed by LeCam (1964) to 
bypass the difficulties discussed above by means of function spaces and further 
developed towards various directions (by e.g. Littaye-Petit, Piednoir & Van 
Cutsem, 1969; Siebert, 1979; Luschgy & Mussmann, 1985; and recently LeCam 
himself, 1986) through the seventies and eighties. 

Let E = (X, A, P) be an experiment. The band L(£) generated by P in 
the space of bounded signed measures on A is called the [-space of E. If Eis 
majorized and n is a majorizing measure equivalent to P (shown to exist by 
Diepenbrock, 1971), then L(E) coincides with {fin; fE L,(X, A, n)}, where fin 
denotes the bounded signed measure having f as the density wrt. n. Assign the 
total variation topology to L(E), denote by M(E) its topological dual and call it 
the M-space of E. Sufficiency is now defined for a sublattice of M(E) as follows: 
A sublattice H is sufficient if there exists a positive linear projection m of M(E) 
onto H such that < p, 7f> = < p, f> for all fin M(E) and p in P. 

It then follows that a sublattice including a sufficient sublattice is 
sufficient, and the smallest sufficient sublattice exists. Thus this sufficiency 
appears to be free from both Burkholder and Pitcher pathologies. 

Two concepts of transition and deficiency play important parts in the 
theory. A transition is defined as a positive linear mapping from the [-space of 
an experiment F to that of another experiment E which preserves the norms of 
the positive elements. The deficiency of F to E is devised for measuring how 
much less informative F is than E when they share a same parameter space. 
Write E = (X, A, P), F = (Y, B, Q) with P = {pọ; 0 E€ O} and Q = {qg; 8 € O} 
where O is the common parameter space. The deficiency of F to E is defined by 


d(F, E) = Inf Ply ` qg — Po ||]; T is a transition from F to E}. 


Now take a sublattice H of M(E). There is an experiment F whose M- 
space is H, provided H is closed. It is proved that H is sufficient if and only if 
the deficiency of F to the original experiment Æ is 0. 

Specialize these concepts to the case of E = (X, A, P) and F = (X, 8, P) 
where B is a subfield of A. A transition from L(F) to L(E) is then a 
generalization of a stochastic kernel from (X, $) to (X, A) and “d(F, E) = 0” is 
a generalization of Blackwell sufficiency. Hence the foregoing paragraph is 
interpreted as sufficiency implies Blackwell sufficiency in the present context. 

Instead of starting from E = (X, A, P) and going to M(E) via L(E), it is 
also possible to take an abstract L-space as L and construct the whole theory 
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directly based on it. Thereby P appears as a set of positive elements p with the 
norms 1 in L, but not X or A. This is more like what LeCam (1964) actually 
did. Here we have followed the way Siebert (1979) presented the theory. 

Torgersen (1979) undertakes a further generalization by including 
unbounded functions into M, and develops an estimation theory which has a 
theorem: Every estimable function admits a UMVU if and only if a 
quadratically complete sufficient statistic exists. 

Such abstract developments render highly refined appearance to the 
theory, though the departure from the basis of the sample space invites critical 
comments. 

It 1s not very easy to compare this theory to the measure theoretic 
treatment, as the concepts do not necessarily correspond to each other. When we 
try to locate a counterpart of a subfield B, it is found in M(E) in the form of the 
sublattice H(B), the totality of the &measurable functions. Whether H(%) is a 
sufficient sublattice or not can be decided only when it happens to be closed, so 
as to admit the projection used to define the sufficiency of a sublattice. And in 
that event, the sublattice H(®B) is sufficient in M(E) if and only if the subfield B 
is pairwise sufficient in E (Littaye-Petit et al., 1969). 

In this correspondence between $ and H(%), no criterion inherent in 
M(E) is readily available to distinguish Sufficiency, PSS and pairwise Sufficiency 
of B on the basis of the properties of H(%) as a sublattice. So the sufficient sub- 
lattices correspond to these three kinds of subfields altogether. 

This suggests significance of pairwise Sufficiency, and in particular PSS, 
as something more than a mathematical tool. Remember that the role played by 
PSS in the majorized case is very similar to, if not as important as, that of 
Sufficiency in the dominated case. 

In the weakly dominated case, PSS possesses some more properties 
almost parallel to those of Sufficiency (Yamada, 1980). Suppose that $ is PSS 
and fis an integrable function. Then there exists a function g which satisfies g = 
E ,U/ B] a.e. for each p in P, and falls only a little short of being B-measurable. 
In precise terms, g is measurable wrt. the strong completion of B, i.e. BV N(P) 
on the support of each p in P, though on the whole space it is measurable only 
wrt. the weak completion N{BVN(p); pE P} (N(P) and N(p) mean the 
families of P-null and p-null sets, respectively). 

This property can then be used to prove analogues of test sufficiency and 
Rao-Blackwell property for PSS, by providing improved test and estimator which 
are close to being &-measurable. 

Further attempts have been made at extending these properties to the 
majorized experiments (Yamada, 1988). 


Basu Theorems 


This means the two renowned theorems of Basu on independence of 
sufficient and ancillary statistics (see Basu, 1982). Because of their nature of 
connecting such basic concepts as sufficiency, ancillarity, completeness and 
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independence, related works still appear in literature. We first state the 
theorems. Assume until otherwise noticed that T is a sufficient statistic. 


I. Suppose that T is boundedly complete. Then an ancillary statistic S 
is independent of T (for all p in P). 


II. Assume that there is no splitting set. Then a statistic S which is 
independent of T is ancillary. 


A splitting set is defined as “a set A such that p(A) = 1 for some p’s and 
0 for all other p’s in P” by Koehn & Thomas (1975). A slightly different 
condition to be assumed in II and some remarks on the conditions are found in 
Basu (1982) and Basu & Cheng (1981). Bayesian versions of these and related 
theorems are given in Basu & Pereira (1983). 

Recently Goossen (1986), while working on conditional completeness, 
made a remark that the assumption of sufficiency of T in I and II can be replaced 
by sufficiency of T for (S, T). 

Lehmann (1981) gives two theorems as adaptations of Basu’s theorems, 
aiming at characterizations of (bounded) completeness. Basu theorems as such 
are not exactly a characterization, as the independence of all the ancillaries from 
T does not imply bounded completeness of T unconditionally. The reason for 
this gap, Lehmann considers, lies in the difference between ancillarity and 
completeness in their nature, one being concerned with the whole distribution 
while the other only with the expectations. Notice the modifications accordingly 
made on each concept to bridge the gap in the theorems thus constructed: 


II. T is boundedly complete if and only if every bounded function of T is 
uncorrelated with every bounded first order ancillary (a statistic 
whose expectation is independent of p). 


IV. Tis F,-complete if and only if every ancillary is independent of T (F, 
means the class of all functions f(T) such that AT) = E[g/ 7] for some 
two valued function g. T is called F,-complete if fe F, and E (AT)) 
= 0 for all p together imply f(T) = 0). 


Basu theorems are closely related to invariance theory, where conditions 
for sufficiency, ancillarity and mutual independence of an invariant and an 
equivariant statistic S are studied. 

A recent work of this kind is Eberl (1983), which deals with the n- 
dimensional location model. Let S be the maximal invariant and T be an 
equivariant statistic in this model. Neither sufficiency of T as such nor its 
bounded completeness is assumed. It follows that: 


V. T is independent of S if and only if T is invariantly sufficient (i.e. 
p( C/T) is independent of p for all invariant sets C). 
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Similar questions are asked and considerable amount of results have been 
obtained with remarkable applications in more general invariant models like 
compact or locally compact spaces and/or groups. As they cannot be detailed 
here, readers are referred to, e.g., Dasgupta (1979) and Ramamoorthi (1990) for 
such results and remarks on their connection with Basu theorems. 
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FOUNDATIONS OF STATISTICAL QUALITY CONTROL 
Richard E. Barlow, University of California at Berkeley 


and 


Telba Z. Irony!, University of California at Berkeley 


Abstract 


The origins of statistical quality control are first reviewed relative to the 
concept of statistical control. A recent Bayesian approach developed at AT&T 
laboratories for replacing Shewart-type control charts is critiqued. Finally, a 
compound Kalman filter approach to an inventory problem, closely related to 
quality control and based on Bayesian decision analysis, is described and 
compared to other approaches. 


Statistical Control 


The control chart for industrial statistical quality control was invented 
by Dr. Walter A Shewhart in 1924 and was the foundation for his Economic 
Control of Quality of Manufactured Product—his 1931 book. (A highly 
recommended recent reference is Deming, 1986.) On the basis of Shewhart’s 
industrial experience, he formulated several basic and important ideas. 
Recognizing that all production processes will show variation in product if 
measurements of quality are sufficiently precise, Shewhart described two sources 
of variation; namely 


i) variation due to chance causes (called common causes by Deming, 


1986); 


ii) variation due to assignable causes (called special causes by Deming, 


1986). 


Chance causes are inherent in the system of production while assignable 
causes, if they exist, can be traced to a particular machine, a particular worker, a 
particular material, etc. According to both Shewart and Deming, if variation in 
product is only due to chance causes, then the process is said to be in statistical 
control. Nelson (1982) describes a process in statistical control as follows: “A 
process is said to have reached a state of statistical control when changes in 
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measures of variability and location from one sampling period to the next are no 
greater than statistical theory would predict. That is, assignable causes of 
variation have been detected, identified, and eliminated.” Duncan (1974) 
describes chance variations: “If chance variations are ordered in time or possibly 
on some other basis, they will behave in a random manner. They will show no 
cycles or runs or any other defined pattern. No specific variation to come can be 
predicted from knowledge of past variations.” Duncan, in the last sentence, is 
implying statistical independence and not statistical control. 

Neither Shewhart nor Duncan have given us a mathematical definition of 
what it means for a process to be in statistical control. The following example 
shows that statistical independence depends on the knowledge of the observer 
and, therefore, we think it should not be a part of the definition of statistical 
control. 


Example 


The idea of chance causes apparently comes from or can be associated 
with Monte Carlo experiments. Suppose I go to a computer and generate n 
random quantities normally distributed with mean 0 and variance 1. Since I 
know the distribution used to generate the observed quantities, I would use a 
N(0,1) distribution to predict the (n+1)* quantity yet to be generated by the 
computer. For me, the process is random and the generated n random quantities 
provide no predictive information. However, suppose I show a plot of these n 
numbers to my friend and I tell her how the numbers were generated except that 
I neglect to tell her that the variance was 1. Then for her, 2,,, is not 
independent of the first n random quantities because she can use these n 
quantities to estimate the process variance and, therefore, better predict 2,44. 

What is interesting from this example is that for one of us the 
observations are from an independent process while for the other the observations 
are from a dependent process. But of course (objectively) the plot looks exactly 
the same to both of us. The probability distribution used depends on the state of 
knowledge of the analyst. I think we both would agree however that the process 
is in statistical control. 

All authors seem to indicate that the concept of statistical control is 
somehow connected with probability theory although not with any specific 
probability model. We think de Finetti (1937, 1979) has given us the concept 
which provides the correct mathematical definition of statistical control. 


Definition: Statistical control 


We say that a product process is in statistical control with respect to 
some measurement variable, z, on units 1, 2,...,n if and only if in our judgement 


PZ, Tos.» Tn) = P(T; » Tiar- Tin) 


for all permutations {1,, %,...,tn} of units {1, 2,....n}. That is, the units are 
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exchangeable with respect to x in our opinion. This definition has two 
implications: namely that the order in which measurements are made is not 
important and, secondly, as a result, all marginal distributions are the same. It 
does not, however, imply that measurements are independent. 

In addition, the process remains in statistical control if, in our 
judgement, future units are exchangeable with past units relative to our 
measurement variable. 

The questions which concern all authors on quality control are: 


(1) How can we determine if a production process is in statistical control? 


and 


(2) Once we have determined that a production process is in statistical 
control, how can we detect a departure from statistical control if it 
occurs? 


The solution offered by most authors to both questions is to first plot the 
data. A plot of the measurements in time order is called a run chart. Run 
charts are also made of sample averages and sample ranges of equal sample sizes 
at successive time points. The grand mean is plotted and control limits are set 
on charts of sample averages and sample ranges. The process is judged to be in 
statistical control if 


i) there are no obvious trends, cycles or runs below or above the grand 
mean; 


ii) no sample average or sample range falls outside of control limits. 


Samples at any particular time are considered to constitute a rational 
sample (i.e., in our terminology, to be exchangeable with units not sampled at 
that time). The only question is that of exchangeability of rational samples over 
time. In practice, control limits are based on a probability model for the rational 
samples and all observed sample averages and ranges over time. 

The marginal probability model can, in certain cases, also be inferred 
from the judgement of exchangeability. If measurements are in terms of 
attributes; i.e., z; = 1 (0) if the i? unit is good (bad) and if the number of such 
measurements is conceptually unbounded, then it follows from de Finetti’s 
representation theorem that 


1 1 
p(z; = 1) = J p(z; = 1|0)p(@)dd = J 0p(0) d0 
0 0 


for some measure p(0)dð and further, that 2,, 29,...,2n are conditionally 
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independent given 0. In this case @ can be interpreted as the long run “chance” 
that a unit is good; i.e., ()/2;)/n tends to 6 with subjective probability one as n 
increases without limit. Chance in this case is considered a parameter — not a 
probability. 

In general, however, exchangeability alone is too weak to determine a 
probability model and additional judgements are required to determine marginal 
probability distributions. Let z4, 29,...,2n be exchangeable measurement errors. 
If, in addition, we judge measurement errors to be spherically symmetric; i.e., 
p(21, T2,- . 2n) is invariant under rotations of the vector (2, 2,...,2n) and this for 
all n, then it follows that the joint probability function is a mizture of normal 
distributions and z; given g? is N(0, o?) while Ti; Tyy.-52n given g? are 
conditionally independent. Also (>> 2?)/ n tends to a limit, 07, with subjective 
probability one. For more details see Dawid (1986). 

The problem of determining and justifying control limits remains. It was 
this problem which led Hoadley (1981) to develop his quality measurement plan 
critiqued in the next section. The usual method for computing control limits 
(e.g. Nelson, 1982) violates the likelihood principle. Basu (1988) has argued 
convincingly against such methods. 


A Critique of the Quality Measurement Plan 


A quality auditing method called the quality measurement plan or QMP 
was implemented throughout AT&T technologies in 1980 (see Hoadley, 1981). 
The QMP is a statistical method for analyzing discrete time series of quality 
audit data relative to the expected number of defects given standard quality. It 
contains three of the audit ingredients: defects assessment, quality rating and 
quality reporting. 

A quality audit is a system of inspections done continually on a sampling 
basis. Sampled product is inspected and defects are assessed whenever the 
product fails to meet engineering requirements. The results are combined into a 
rating period and compared to a quality standard which is a target value of 
defects per unit. It reflects a trade-off between manufacturing cost, operating 
costs and customer need. 

Suppose there are T rating periods: t= 1,..., T (current period). For 
period t, we have the following data from the audit: 


n, = audit sample size; 
z, = number of defects in the audit sample; 
s = standard number of defects per unit; 


e, = expected number of defects in the sample when the quality standard 
is met; e, = Sn 


FOUNDATIONS 103 


T T 
i= z = quality index (measure of the defect rate). 


I, is the defect rate in units of standard defect rate. For instance, if I; = 2, it 
means that twice as many defects as expected have been observed. 

The statistical model used in QMP is a version of the Empirical Bayes 
model. The assumptions are the following: 


1. zhas a Poisson distribution with mean nàs i.e. (zin; ~ Poifn;à;) 
where A, is the true defect rate per unit in time period t. If À, is 
reparametrized on a quality index scale, the result is: 


6, = A,/s = true quality index. 


In other words, 6; = 1 is the standard value. Therefore, we can write: 
(adb) ~ Poi(e,). 


2. For each rating period t, there is a true quality index 6; 06, t = 
1,...,. is a random sample from a Gamma distribution with mean @ 
and variance y°. 6 is called the process average and 7? is called the 
process variance. We can write (0,|0,y?) ~ Gamma(6?/y’, 6/77). In 
this model, both @ and y? are unknown. 


3. 0 and 7° have a joint prior distribution p(0, 7”). 


The parameter of interest is 07 given the past data, dy- and current 
data, zy. Here d1 = (24, Z9,...,27,) and dp is a constant. 

The model assumes that the process average, 9, although unknown, is 
fixed; i.e., the model assumes exchangeability. In reality 6 may be changing. In 
order to handle this, the QMP procedures uses a moving window of six periods of 
data. 

A suitable way to describe and to analyze the QMP model is via 
probabilistic influence diagrams. Probabilistic influence diagrams have been 
described by Shachter (1986) and Barlow and Pereira (1990). 

A probabilistic influence diagram is a special kind of graph used to model 
uncertain quantities and the probabilistic dependence among them. It is a 
network with directed arcs and no directed cycles. Circular nodes (probabilistic 
nodes) represent random quantities and arcs into random quantities indicate 
probabilistic dependence. An influence diagram emphasizes the relationships 
among the random quantities involved in the problem and represents a complete 
probabilistic description of the model. The solution for the QMP model, i.e., the 
posterior distribution of 07 given the past data, d7_, and current data, zy can be 
achieved through the use of influence diagrams operations, namely, node merging, 
node splitting, node elimination and arc reversal. These operations are described 
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in Barlow and Pereira (1990). Figure 1 is an influence diagram representation 
corresponding to the QMP model. 

The joint distribution for random quantities in the QMP model is 
completely defined by the influence diagram above. The absence of arrows into 
node (0, 7’) means that we start with the unconditional joint distribution of 0 
and y°. The arrows originating at note (0, y?) and ending at nodes 6; (t = 
1,...,7') indicate that the distributions of 0; are conditional on @ and y2. This 
means that the process is considered exchangeable, that is, the process average, 0, 
is constant over time. Finally, each node z, is the sink of an arrow starting at 
node 6; meaning that the distribution of the random quantity z; is conditional on 
0, for each t = 1,...,T. 


Exchangeability assumption: 
(Ot 19, 7 2) ~ G(0?/720/7?) t=I,...,T 


(xt lOt) ~ Poi( e 84) 


Figure 1 


The QMP chart is a control chart for analyzing defect rates. Quality 
rating in QMP is based on posterior probabilities given the audit data. It 
provides statistical inference for the true quality process. Under QMP, a box and 
whisker plot (Figure 2) is plotted each period. The box plot is a graphical 
representation of the posterior distribution of 67 given d7_, = (2,...,27-,) and 
Tr. The standard quality on the quality index scale is one. Two means twice as 
many defects as expected under the standard. Hence, the larger the quality 
index, the worse the process. 

The posterior probability that the true quality index is less than the top 
whisker (Iggo) is 99%. The top of the box (Ig5%), the bottom of the box (I5%) 
and the bottom whisker (Ij) correspond to probabilities of 95%, 5% and 1%, 
respectively. 
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The z is the observed value in the current sample, the heavy dot is the 
Bayes estimate of 0 and the dash is the Bayes estimate of the current quality 
index (67), a weighted average between zy and 0. 


In a complete QMP chart (with all boxes), the dots are joined to show 
trends, i.e., it is assumed implicitly that the quality index 0; may be changing 
from period to period. 


1. Exception reporting 


The objective of quality rating is to give a specific rule that defines 
quality exceptions and a measure (e.g., probability) associated with an exception. 
For QMP there are two kinds of exceptions: 


a. A rating class is Below Normal (BN) if Iggy, > 1, ie. if P(r > 1) 
> 99%. 


b. A rating class is on Alert if Iggy < 1 < I5% ie, if 95% < 
P(@r > 1) < 99%. (See Figure 3.) 


Products that meet these conditions are highlighted in an exception 
report. 
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normal 


alert 


below 
normal 


2. Posterior distribution of current quality 


In order to get the exact solution for QMP, we have to compute the 
posterior distribution of #7 given dz, = (2,...,27_,) and zy Hoadley (1981) 
describes a complicated mathematical “solution” to this model. It can be best 
understood through the following sequence of influence diagrams: 
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Diagram 1: 


Diagram 2: 


Diagram 3: 


Diagram 4: 


Diagram 5: 


Figure 4 


Starting model: (0, y?) ~ p(6, 7”). 


2 
(0410,7?) ~ Gamm T 5) fort?=1,...,T. (2,04) ~ Poi(e,O,). 


Nodes 64,...,9 r are eliminated through integration. 


(z,|0,77) ~ Negative Binomial (Aitchison and Dunsmore, 1975). 


Nodes 2, %,...,2-7 are merged, i.e., the joint distribution of dr = 
(Zi, £25.. £r) is computed. 


T 
(d7|0,77) ~ I] Negative Binomials. 
t=1 


The arc that goes from node (6,77) to node dv is reversed, i.e., 
Bayes theorem is used to compute the posterior distribution of 0 
and 7° given dr. p(9,y7|d7): posterior for 6 and 7? given dr. 


Node dr is split. The joint distribution of (z,, 25,.. 47) is written 
as the distribution of d7_, = (21, Z,...,27_,) and the conditional 
distribution of zp given dy. 
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Diagram 6: Node 6,7 is added again into the model. The distribution of 67 
given (0,77) and Tris determined. 


Diagram 7: Node (6,77) is eliminated through integration. 


As we can see from the diagrams, the quality indexes 6,,...,07 are 
eliminated in order to compute the distribution of the data, dy, given 0 and y? 
and then, to compute the posterior distribution of 0 and y? given the data, dr. 
Nevertheless, the parameter of interest is the current quality index, 07, which has 
to be re-introduced into the influence diagrams. This procedure is not correct. 
According to this, zy is influencing 07 twice in diagram 6. On one hand, directly 
(there is an arrow from zy to 07), and on the other hand, through the posterior 
distribution of 0 and 7? given dy. In other words, node 67 is eliminated (in 
influence diagram 2) and is added again (in influence diagram 6) and this is not 
the way one should solve an inference problem. 

Even if this procedure were correct, the posterior distribution of #7 would 
be a complex triple integral depending on the prior distribution assessed for 9 and 
y?. This integral would have to be inverted in order to compute the QMP box 
chart. In other words, the exact solution is mathematically intractable, especially 
when many rating classes have to be analyzed each period. The result is a 
complicated algorithm (Hoadley, 1981) that computes all the parameters that are 
needed in order to construct the Gamma distribution for 97|d7. Hoadley’s model 
assumes exchangeability, i.e., statistical control. Hence it does not provide an 
alternative to statistical control which can be used to decide whether or not the 
process is still in statistical control at the current time period. In the absence of 
an alternative model to exchangeability a better solution would have been to 
simply plot the standardized likelihoods (gamma densities) for 6; at each time 
period based on the Poisson model. This would implicitly assume the 0ps 
independent a priori. 


A Kalman Filter Model for Inventory Control 


As we have seen, the problem of quality control is to determine if and 
when a process has gone out of statistical control. The main difficulty with 
classical quality control procedures and also with the QMP model is that the 
models used assume the process is in statistical control and consider no 
alternative models to this situation. For coherent decision making, it is necessary 
to determine logical alternative models corresponding to a process out of 
statistical control. 

In a paper dealing with inventory control (Barlow, Durst and Smiriga, 
1984), a Kalman filter model was discussed from a decision theory point of view 
which could also be used for quality control problems. The paper describes an 
integrated decision procedure for deciding whether a diversion of Special Nuclear 
Material (SNM) has occurred. The problem is especially relevant for statistical 
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analysis because it concerns (a priori) low probability events which would have 
high consequence if any occur. Two possible types of diversion are considered: a 
block loss during a single time period and a cumulative trickle loss over several 
time periods. The methodology used is based on a compound Kalman filter 
model. 

Perhaps the simplest Kalman filter model is 


y(t) = A(t) + (t) 


O(t) = O(t-1) + w(t), 


(1) 


where y(t) is the measured inventory at time period t and (t) is the actual 
inventory level. Our uncertainty with respect to measuring error is modeled by 
u(t) while w,(t) models our uncertainty about the difference in the actual 
amounts processed between time period t-1 and t. 

The y(t) process will be in statistical control in the sense of the first 
section if and only if w(t) = 0 for all t. For the inventory problem it seems 
reasonable to use (1) to model the process in the absence of any diversions. Later 
we will extend this model to account for possible diversions. 

The compound Kalman filter model allows a decision maker to decide at 
each time period whether the data indicate a diversion. A block loss, by 
definition, will be a substantial amount which, it is hoped, will be detected at the 
end of the period in which it occurs. A trickle loss, on the other hand, is a 
smaller amount which is not expected to be detected in a single occurrence. A 
trickle loss may consist of a diversion or process holdup (or both), while a block 
loss is always a diversion. Two models are given for the process during each time 
period; in one, a block loss is assumed to have occurred, while in the other, only 
the usual trickle loss takes place. Since there are two models at each time period, 
a fully Bayesian analysis would required 2” models at the end of n time periods, 
which is computationally untenable. A simple approximation is made which 
rests on the assumption that a block loss is a low-probability event. With this 
approximation only two models need be considered at each period, with all 
inference conditional on the assumption of no block loss in past periods (which 
has probability virtually equal to 1 as long as we have never come close to 
deciding that a block loss has occurred). By comparing these two models, we 
decide whether a block loss has occurred, and if we decide that it has an 
investigation is initiated. Since trickle loss, at least in the form of process 
holdup, is always assumed to occur, we will never decide that no trickle loss has 
occurred. We will either decide that a trickle diversion has occurred over several 
past periods, or we will decide that we as yet are unconvinced that a trickle loss 
beyond the normal holdup has occurred. 

In Figure 5, (1), (@(2),..., etc., denote the amount of possible but 
unknown block losses during their respective time periods. The amount of 
possible but unknown trickle losses are denoted by 7(1), 7(2),..., etc. In our 
approach, we shall have two models: one model for block loss, say Mp, and one 
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model for trickle loss, say My. We believe that model Mpg holds with probability 
p( Mp) and model My with probability 1 - p(Mp). Given data D, p(Mp|D) is our 
updated probability for the block loss model Mpg. If our updated probability for 
the block loss model is too high, then we will decide to investigate the possibility 
of a block loss. A decision regarding possible trickle loss, on the other hand, is 
based on the probability that loss beyond the normally expected holdup has 
occurred over several time periods; i.e. 


P{r(1) +...+ r(t) > c| D} 
where c is the normally expected holdup over t time periods. Thus, as indicated 


in Figure 5, our decision sequence is the customary one; at each time period we 
either decide that a 


BC) 
Block 


(1) 
No Block Loss 


Possible Trickle Loss 


Stop 

and 

Investigate No Block Loss 
Possible Trickle Loss 


Figure 5 Diagram of possible decision sequences relative to diversion of special 
nuclear material 


substantial block loss has occurred in the most recent period, that an unusually 
large trickle loss has been occurring in the past few periods, or that no block loss 
is likely to have occurred and that trickle loss is within acceptable limits. Our 
decision procedure does not formally permit the conclusion that a block loss has 
occurred other than within the most recent period, but it is shown that certain 
trickle alarms indicate the presence of an undetected block loss in some past 
period. 

In order to clearly illustrate the salient features of these models, consider 
the simplified model (1) with only one measurement each period. At time t, 0(¢) 
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is the quantity of interest, but we can only observe y(t). We assume that all 
variables in (1) are normally distributed. 
The simplified trickle model My is: 


y(t) = O(t) + oà), 
O(t) = O(t-1) - T(t) + w(t), (2) 
T(t) = r(t-1) + w(t). 

The simplified Kalman filter block model Mz is: 
y(t) = A(t) - A(t) + (t), 
A(t) = O(t-1) — T(t) + w(t), (3) 
T(t) = r(t-1) + wg(t). 


For the MB model, assume that ((0) is also normally distributed. 

The values of distribution parameters, even in our simplest model, must 
be carefully set. Too little initial uncertainty about possible trickle loss may 
make the model surprisingly unresponsive to large unexpected losses. A set of 
distribution parameters can be entirely self-consistent, seem on casual inspection 
quite sensible, and still produce undesirable behavior of the detection procedure. 
Thus distribution parameters should not be set arbitrarily or casually, but only 
after a careful assessment of process and loss uncertainties which takes into 
account the effect of the parameters on the resulting decision procedure. 

The compound Kalman filter model provides a detection process which 
can compete with currently popular methods. Large block losses are detected 
handily, while somewhat smaller block losses are often detected later by the 
trickle model. ‘Trickle losses consistently in excess of the expected holdup are 
detected rapidly, and smaller trickle losses are detected as the total amount of 
trickle loss becomes large. 

With standard quality control methods, decisions must be made with a 
test of fixed significance level; otherwise, the frequentist interpretation of the test 
does not hold. Since we are dealing with probability distributions, we are not 
limited to setting a critical threshold and a critical probability. In fact, 
simulations indicate that it is best to take into account all the information given 
by the posterior probabilities. The results of a single hypothesis test, although a 
convenient summary, may be misleading. The user of these methods is 
encouraged to examine the probabilities of multiple critical regions, something 
which is not possible with standard quality control methods. 
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PREQUENTIAL DATA ANALYSIS 


A.P. Dawid, University College London, Department 
of Statistical Science, England 


Abstract 


The basic theory of the prequential approach to data analysis is 
described, and illustrated by means of both simulation experiments and 
applications to real data-sets. 


Introduction 


The prequential approach to the problems of theoretical statistics was 
introduced by Dawid (1984). It is based on the idea that statistical methods 
should be assessed by means of the validity of the predictions that flow from 
them, and that such assessments can usefully be extracted from a sequence of 
realized data-values, by forming, at each intermediate time-point, a forecast for 
the next value, based on an analysis of earlier values. The main emphasis is on 
probability forecasting, requiring that one describe current uncertainty about the 
predictand by means of a fully specified probability distribution. However, point 
forecasts, or other forms of prediction, can also be accommodated. 

The purpose of the above paper was to indicate the fertility of the 
prequential point of view for furthering understanding of traditional concerns of 
theoretical statistics, such as consistency and efficiency. However, the prequential 
approach is essentially data-analytic. As such, it is particularly well suited to 
empirical investigation of the structure and properties of real-world observations, 
and their sources. In this paper, we shall discuss some of the ways in which 
prequential assessment may be applied in practical problems, including goodness- 
of-fit, model choice and density estimation. These methods are illustrated, by 
means of simulation experiments and applications to real data. 


Prequential Assessment 


Let Y = (Y), Yo;...) be a potentially infinite sequence of observables, 
and y(t) = (Yj, Yo,.... Y,). We consider methods of forming, for each 
eat Web” ere predichoa: Vp,» for Y,, based on past data yr = yD); or, more 
pene, of deciding on es action a, on the basis of y1), when subject to a 
loss L,(y, a) if Y, = y and a, = a. Such a method M having been applied for k = 
1 to n, and resulting in actions (@4, 4,...,@n), its performance might be assessed 
by means of its total prequential loss 


L(M) = È Lily a), 


which measures the success of its earlier forecasts; and comparison amongst 
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methods on this basis provides a guide (albeit imperfect) as to their likely relative 
future performance. 

Starting from a parametric family of such methods, M = {Mọ: 0 ET}, 
with Mg specifying a, = a iy; 6), each 6-value is thus assessed by 


O = S° Edy, aE 6). 
(0) = YL yy a(g; 0) 


The optimizing strategy Ab based on M then uses, for selection of a,j, M3 f 
n 


where 6, minimizes L(@) (n = 0, 1,..; modification for small n may be 
required). This itself needs to be assessed by its prequential loss 


xy W\ _ a k-1). Q 
L¥( At) = dtd yp a (yD; ôe), 


which will typically exceed L*(ĝn). 

Prequential assessment of past predictive performance is very close in 
spirit to the method of cross-validation (Stone, 1974) but bases its prediction for 
Y, on all previous outcomes, rather than on all outcomes distinct from Y, In 
both methods, the intention is to avoid the bias involved in letting Y, contribute 
to its own prediction, and so to produce an honest assessment of uncertainty. 


Probability Forecasting 


One way to choosing the action a,, after observing yey) = y(F), is to 
specify a predictive distribution P, for Y, and to choose a, to minimize the 
predictive expected loss 


J Liy, a)dP,(y,). 


Specification of such a sequence of predictive distributions (P,), for any data, 
constitutes a probability forecasting system (PFS), and is equivalent to choosing a 
joint distribution P for the sequence Y. Under broad regularity conditions, it 
then follows that, with P-probability 1, lim sup(La(M) - LA(M')) < oo, where 


M is given by the above method, and M’ is an arbitrary method. Thus if Nature 
is regarded as generating Y from P, then using P as a PFS to construct an action 
sequence will be optimal, for any loss function. 

A PFS P for Y, or its associated sequence (P,) of predictive distributions 
of Y; given y1) = y(*-1), can be assessed directly if we take the action a, to be 
the choice of a distribution Q, for Y}, and use a proper scoring rule Sil y, Q,); i.e. 
such that, for any distribution P}, for Y,, E PiS Yp, Q,)| is minimized in Q, 
when Q, = P, (Dawid, 1986). Then the optimal sequence of actions is just the 
sequence (P,). The assessment becomes particularly simple if we use the 
logarithmic scoring rule S,(y, P,) = -log fm, f, being the density of Pj. We 
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then obtain L3(P) = —log fy), f being the implied joint density for y\”) under 
P. That is, we can, and henceforth shall, assess and compare PFS’s by means of 
their prequential log-likelihoods. 

It is interesting to note that, if the distributions P and Q for Y are 
mutually absolutely continuous, then L}(P) - L(Q) will (with probability 1 
under either P or Q) remain bounded, and may oscillate between positive and 
negative values. In this case we shall never achieve an ultimate preference for 
either PFS, and it seems that we remain forever in a quandary as to which to use 
for further forecasts. However, a result of Blackwell and Dubins (1962) shows 
that, in this case, the forecasts produced by P and Q will be asymptotically 
indistinguishable, so that the choice is unimportant. This is an instance of 
Jeffreys’s Law (Dawid, 1984): observationally indistinguishable statistical ap- 
proaches must be in essential agreement on their assertions about observables. 

If P = {Pg: 9€T} is a parametric family of PFS’s, with predictive 
densities f(y; 0), the optimizing strategy P based on P describes Yn+ı as having 
density f, 41 Yaga On); On being the maximum likelihood estimator based on data 
y(™), The success of this plug-in MLE strategy must itself, however, be judged by 
means of its own prequential log-likelihood, viz. 


n rs 
log II fy; b), 
I= 


rather than ss ; 
log IV FC, 0n). 
I= 


Similarly we can judge any other such statistical forecasting system (SFS), based 
on the same model or on another. A SFS might involve plugging-in some 
estimate of 0 from past data, as above; Bayesian or fiducial elimination of 6; or 
any other suitable (standard or ad hoc) procedure. However, any such strategy 
will itself always be describable as a PFS, and hence as a joint distribution for Y. 
This allows standard probability theory to be applied in theoretical studies of the 
performance of a SFS for data generated from Pg € P, and opens up a fresh 
approach to the traditional problems of statistical theory (Dawid, 1984). In 
general, (efficient-estimate) plug-in and Bayesian SFS’s are asymptotically 
optimal. The latter yield prequential likelihoods expressible in the form 


J fy; 9)x(6)d0, which has computational advantages, as well as being 


insensitive to reordering of the data. 


Empirical Assessment 


Sometimes an absolute assessment is required as to whether a PFS P 
adequately describes data y. If the Y; are continuous real variables, and F; 


denotes the distribution function of Y; under P;, then U = (Uj, Uo,...), where U; 
= F{Y;), should be independently uniform on [0,1] if Y arises from P, and so a 
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variety of tests can be based on the observed values u. To assess uniformity, we 
might examine the u-plot, i.e. the empirical c.d.f. of the ws, which should be 
close to the line of unit slope. This could be tested formally using, say, the 
Kolmogorov-Smirnov statistic. One should also inspect the (u,) for any sign of 
non-independence, trend, or dependence on omitted variables. A simple indicator 
of trend is provided by the uniform conditional test (Cox end Lewis, 1966) or y- 


plot, which forms the empirical c.d.f. of (y, .), where y; = = Xz / x dF with z; 


—log(1 — u,). These y’s are uniform order-statistics under P, add this can again be 
tested formally. 

If the Y; are 0-1 variables, we can form calibration plots in which, for 
various m € [0, ligt the observed relative frequency of Y; = 1 over the set of 
occasions having II; = m~ (where H; = PY; = 1)) is plotted against m. This 
should give an approximate cone line. More sna aa we can construct test- 
statistics such as Z = ¥(Y; - Il,)/(=U,1-I,)]'/?, the sum possibly being 
restricted to a suitable subset of the data. Under very weak conditions, not 
requiring independence, Z and similar standardized statistics will be 
asymptotically standard normal under P (Seillier and Dawid, 1987) and inde- 
pendent of statistics based on disjoint subsets. An observed value z can thus be 
referred to standard normal tables, or a sum of squares of 2’s based on k disjoint 
subsets to chi-square tables with k degrees of freedom. 

It is noteworthy that all the methods described above are applicable 
given only the two sequences, of outcomes and of their probability forecasts, and 
make no reference to the structure of P over outcomes not observed. This is in 
accord with the Prequential Principle (Dawid, 1984). 

If P is itself constructed as a SFS based on a parametric model P = 
{Po}, it turns out, again under mild conditions, that the asymptotic distributions 
of the test-statistics considered above continue to hold under any Py E P 
(Seillier et al., 1988). Consequently, these methods can be used to test the 
overall goodness-of-fit of a parametric model. 

If the distribution or model being used fails to describe the data, it may 
be possible to massage it to provide a better fit. Thus suppose that the (u,) 
above look like a random sample, but from a non-uniform distribution. This 
distribution could itself be estimated, either parametrically or nonparametrically 
(as in Density estimation below). If the estimate based on ul”) is Gn, then Y,41 
could be forecast by requiring that Fn41(Yn41) has distribution Gn, rather than 
uniform. Alternatively, serial correlation, or other suspected structure, in the (u,) 
could be estimated and allowed for. In the (0 — 1) case, if previous occasions on 
which the same probability forecast as p, 41 Was issued had resulted in a 
proportion q of 1’s, then Pao , might be replaced by g. Such adaptive 
recalibration methods can kaprove the performance of a badly chosen initial 
model, although there can be no guarantee that they will, since the recalibration 
is based on the past but applied to the future. 
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Model Choice 


Given a choice between two competing models, say P = {Pg} and Q = 
{Q,}; we can first replace each of these by an appropriate SFS, say P and Q, 
respectively. We might then optimize the choice between these at each time- 
point. Thus if it were P, say, rather than Q, that gave the larger prequential 
likelihood (or smaller total prequential loss) to the data y(*) at time k, the 
probability forecast for Y,41 would be that based on P. Of course, such a two- 
stage optimization strategy needs assessing afresh in its own right. The method 
extends to more stages, and to an arbitrary collection of models at each stage, 
but clearly less trust can be placed in prequential analyses iterated to more 
stages: even though the prequential approach avoids obvious bias at each stage, 
no finite set of data can support more than a certain amount of investigation 
without throwing up misleading messages. 

In place of repeated optimization, one can take a Bayesian approach, 
assigning prior weights a and 1 - a to P and Q. After observing y(*), with 
prequential joint density f(y) under P and gy) under Q, a is replaced by a k 


= af(y))/[affy) + (1 - a)g(y)], and the forecast density for Y,41 is then 
the mixture af, 41 + (1 - 4) 9, yr The overall prequential likelihood for this 


strategy is simply afy(™)) + (1 - aæ)g(y(”)). Again the method extends simply to 
more models and more stages. ~ 

If one has a finite or countable collection of alternative models, and the 
data arise from some distribution in one of these, either of the above methods 
will be consistent and asymptotically optimal, in the sense that their forecasts 
will tend to those given by the true distribution, and at the fastest possible rate. 
However, for finite data-sets, the forecasts under the two methods may look 
rather different. In either case, if the true distribution is contained in a model of 
high-dimensionality, early analysis will generally tend to favor incorrect models 
of low dimensionality. This is intuitively sensible, since, early on, the mis- 
modelling bias may well be less of a problem than the imprecision involved in 
trying to estimate many parameters. 

As an alternative to allowing such transient behavior to be entirely data- 
driven, as above, one might build it in directly, by setting out with a strategy for 
choosing, at each stage, the complexity of the model to be fitted and how it is to 
be used for prediction. Different strategies, all yielding consistent estimates of 
the true model (and which use each fixed model efficiently) will all be 
asymptotically equally good. However, their transient behaviors, which may be 
long-lasting, can be very different, with some yielding much larger prequential 
log-likelihoods (or, more generally, much smaller prequential losses) than others 
even though these discrepancies will be bounded as the sample size goes to 
infinity. More empirical and theoretical work is needed to indicate good forms 
for such strategies. A sensible super-strategy could be built up from a low- 
dimensional parametrized family of such strategies, using optimizing or Bayesian 
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methods. This could combine good transient behavior with sensitivity to the 
data and avoidance of data-mining. 


Non-parametric Approximation 


Many non-parametric problems, such as density estimation or fitting a 
stationary time-series, can be approached through a sequence of finitely 
parametrized methods, such as fitting histogram or kernel density estimates with 
adjustable bin width, or autoregressive models of various finite orders. One can 
then apply the techniques of the previous section, even though none of the models 
used is now expected to contain the distribution generating the data. The 
component models will generally each be characterized by some quantity, such as 
kernel width (w) or autoregressive order (p), which controls the balance between 
over-fitting (tracking noise in the data) and over-smoothing (not picking up the 
signal). Prequential choice of such a quantity will start out with a preference for 
smoothing (large w, small p), and then, as the data-sequence grows longer and 
can support more detailed modelling, gradually move towards fitting the past 
data more and more closely (w — 0, p — oo). Such a method will often be 
prequentially consistent for a wide range of generating distributions, and can 
provide sensible answers based on finite data-sets, by making the predictively 
optimal compromise between fitting and smoothing. 

Investigation of the structure of good strategies, for choosing the model 
to fit at each stage, is still more vital in this context, since the behavior described 
as transient in the previous section now extends to infinity! Again, much further 
empirical and theoretical work is required to illuminate this problem area. 


Simulations 


1. Time-series modelling. 


Autoregressive models of varying order k (0 < k < 8) were fitted to 
several simulated time-series of 500 observations, and their prequential 
likelihoods calculated using both optimization (plugging-in current least-squares 
estimates) and Bayesian methods (using a non-informative prior), always 
excluding the first 15 observations. Results were as follows. 


(i) Independent standard normal variates: Y, = €; Prequential Log-Likelihoods 


k : 0 1 2 3 4 5 6 7 8 
Optimization : -715.5 -717.1 -721.0 -724.0 -728.0 -728.7 -735.1 -739.1 -740.5 
Bayes : -712.8 -713.9 -717.6 -719.4 -722.0 -722.4 -726.0 -728.4 -730.1 


The strategy of optimizing over k chose k = 0 at all points, except one, 
beyond the 57th observation, and chose k = 1 at all the exceptional points. This 
strategy itself had a prequential log-likelihood of -714, better than that for any 
fixed k. 
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The Bayes strategy (using equal prior probabilities) finished by assigning 
probability 0.75 to k = 0 and 0.25 to k = 1. Its prequential log-likelihood too 
was -714. 


(ii) Autoregression: Y, = 0.1Y,, - 0.3Y,. + 0.2Y, +€; Prequential Log- 


Likelihoods 
k : (0) 1 2 3 4 5 6 7 8 
Optimization : -723.9 -725.3 -705.5 -700.1 -701.2 -701.3 -704.4 -708.6 -709.3 
Bayes : -722.8 -724.1 -703.5 -698.3 -699.1 -700.6 -702.7 -705.9 -707.6 


There is a clear preference for the true order, with under-fitting being 
more heaving penalized than overfitting. Optimizing over k chose k = 2 up to 
observation 40, k = 3 thereafter. This strategy had a prequential log-likelihood 
of -700, indistinguishable from that of k = 3. The Bayes strategy ended by 
assigning probability 0.63 to k = 3, 0.29 to k = 4 and 0.07 to k = 5, and itself 
had a prequential log-likelihood of -700. 


(iii) Moving average: Y, = 0.5¢,- 0.2€,,; Prequential Log-Likelihoods 


k - 0 1 2 3 4 5 6 7 8 
Optimization : -370.1 -354.8 -355.3 -357.7 -355.5 -356.9 -358.7 -360.3 -364.5 
Bayes : -368.9 -353.4 -353.2 -355.7 -353.2 -355.3 -357.1 -359.1 -362.2 


The true process can be expressed as an infinite-order autoregression: Y; 
= -0.4Y,, — 0.16 Y — 0.064 Y3 - ... + 0.5€, The optimal autoregressive fit 
to 500 observations, however, gave k = 1 (optimization) or k = 2 (Bayes), closely 
followed by k = 4 (for which the estimated coefficient of lag 4 was -0.139, 
compared with the true value of -0.026). Optimizing over k gave k = 1 at all 
points, except for observations 16 to 33 (for which k was 0) and most points 
between observations 460 and 486 (with k = 4). This strategy had prequential 
log-likelihood of -355.5. The Bayes strategy assigned probabilities 0.27 to k = 1, 
0.33 to k = 2, 0.03 to k = 3, 0.32 to k = 4, and 0.04 to k = 5, and itself had a 
prequential log-likelihood of —354.3. 


2. Density estimation. 

Simple histogram-type density estimators were constructed from data- 
values in [0,1], based on a division of the unit interval into k equal sub-intervals. 
For each initial sub-sequence of data, the current density estimate was used to 
forecast the next observation. This was repeated for 1 < k < K. 


(i) A random sample of size 1000 from the uniform distribution on [0,1] yielded 
the following overall prequential log-likelihoods (up to K = 10); 
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k : 1 2 3 4 5 6 7 8 9 10 
log-likelihood : O -3.7 -6.4 -8.8 -12.5 -16.0 -18.8 -22.3 -25.1 -32.4 


The deterioration in performance when fitting more intervals than needed 
(viz. 1) is clear. 

The optimizing strategy, formed by selecting, at each point, that value 
for k yielding the highest prequential likelihood to date, always chose k = 1, 
except at a number of points up to the 52nd observation, for which k = 2 was 
chosen. 


(ii) A random sample of size 3000 was generated from the symmetric unimodel 
density 


f(2) = prsin(12) (0 < z < 1). 


With K = 20, the optimal k based on all the data was 14, the prequential log- 
likelihoods for k = 10 to 15 being, respectively, 399.3, 389.2, 401.3, 396.9, 402.2 
and 399.0. When optimizing over k at all points, the first and last appearances 
of various values, and their frequencies, were: 


k 1 2 3 4 5 6 7 

First used : 1 6 21 77 229 148 319 

Last used 20 43 154 188 388 395 842 
Frequency : 18 6 109 16 46 182 39 

k 8 9 10 1i 12 13 14 >14 
First used : 1423 400 1457 - 1435 - 2856 - 
Last used ; 1423 2133 2335 - 2915 - 3000 - 
Frequency : 1 1118 434 0 892 0 139 0 


The general message of the above simulations would seem to be that, 
even for large data sets, it is generally far more effective to fit a very simple 
model that is approximately true, rather than one which contains the true 
distribution (or comes close to doing so), but is of highish dimension. 


Applications 
1. Weather forecasting. 


Jain (1983) analyzed a 53-year sequence of daily precipitation records 
from Morogoro, Tanzania, as discussed in Stern and Coe (1984). The model P 
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for the conditional probability p , of rain on day t (coded as Y, = 1), given past 
outcomes, was a non-stationary two-state second-order generalized linear Markov 


Chain: 


4 


where t = (27t/366), i and j are the outcomes of days t — 2 and t- 1, and 6 
consists of the a’s and 6’s. The parameters were estimated recursively, with 
initial estimates fitted, using maximum likelihood, to the first 700 data-points, 
and the probability forecasts p, of the resulting plug-in strategy compared with 


the actual outcomes (t > 700). Calibration plots and test-statistics were 
constructed for various subsets of the data, corresponding to the months of the 
year, and to specified outcomes of the three previous days. Table I gives, for 
each month, the overall proportion y of rainy days, and the average forecast 


probability p. The final line gives values of the test statistic 
. \2 n 7 
_ U(y,- b)” -EPC - B,) 


ZB T 
22,1 = p.)(1 a 2) | 2 

for assessing departure from expectation of the within-month Brier Score 
xy, - PAM These should be approximately independent standard normal 
variables under the model P. The combined chi-square of 78 on 12 degrees of 
freedom clearly indicates poor model fit, and closer scrutiny reveals that the 
model is noticeably under-forecasting rain in April, and when the third previous 
day was wet. 


TABLE I 
Month : J F M A M J J A S O N D 
y : 21 .22 .33 .54 .31 .10 .06 .04 .09 .10 .17 .22 
Ê? : 20 .20 36 .47 .31 .09 05 04 .09 .09 .16 .20 
2B : 1.57 468 2.15 5.00 1.80 2.19 2.26 0.40 0.42 0.66 1.26 2.93 


2. Medical diagnosis. 


Seillier (1982) analyzed 58 cases of jaundice, caused either by hepatitis 
(Y = 1) or by cirrhosis (Y = 0). Various logistic models to discriminate between 
the two diagnoses were considered, using regressor variables chosen from a set of 
ten symptoms (A, B, C, D, E, F, X1, X2, X3, X4) and a location indicator Q. 


122 A.P. Dawid 


Each model was fitted by maximum likelihood to the first k cases 
(k = 30, 31,...57), and used to provide a probability forecast p, |, for Yp41, 
based on its associated regressor variables. The assessment of each model was 
then based on its overall Brier score 2 (y, — P,) . The results are shown in 


Table II, which also gives p, for comparison with 7 = 0.29. 


TABLE II 
Variables Brier Score D 
A+B+C+D+E+F 4X14 X2+ X3 + X4+Q 4.7 0.25 
A+B+C+D+E4+F 4X1 + X2 + X3 + X4 3.8 0.36 
A+ C+D+E+ X1 + X2 + X3 + X44+Q 4.6 0.24 
A+ C+D+E+ X1 + X2 + X3 + X4 3.8 0.36 
A + D+E+ X1 + X2 + X3 + X4+Q 4.0 0.22 
A + D+E+ X1 + X2 + X3 + X4 3.4 0.39 
A + D + X1 + X2 + X3 + X4+Q 4.6 0.32 
A + D + X1 + X2 + X3 + X4 3.6 0.38 
A + D + X1 + X3 + X4 +Q 3.0 0.23 
A+ D+ X1 + X3 + X4 2.3 0.31 
A + D + X1 + X3 + Q 3.4 0.24 
A+ D+ X1 + X3 3.0 0.33 
D+ X1 + X3 + Q 3.0 0.25 
D + X1 + X3 2.9 0.39 
D + XL Q 4.8 0.24 
D + X1 4.3 0.37 
D + Q 5.3 0.22 
D 5.1 0.36 


Fitting all variables leads to poor predictions on this size data-set, as 
does fitting only two or three. The most successful model, as measured by its 


Brier score, is A + D + X1 + X3 + X4, which also has 9 closest to 7. It is of 
interest that, for any collection of symptom variables, adding in the location indi- 
cator Q leads to worse predictions. This offers some empirical support for the 
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arguments of Dawid (1976) that suitable diagnostic models should be robust over 
a range of locations. 


3. Educational scaling. 


Opie (1983) conducted an analysis to see whether items in an educational 
testing item-bank fitted the Rasch model, under which P(student i gets item j 


correct) = ener i (1 + etha, The data-set contained responses to 60 test 
items from 150 students. At an intermediate stage, a number of items, 1 to k- 1 
say, have been accepted, and item k is under test. For m = 75 to 150, the 
parameters are estimated (by maximum likelihood) from the responses of 
students 1 to m on items 1 to k, omitting that of student m on item k. The 
fitted probability for this omitted response can then be calculated, and the 
process repeated with m increased by 1. Comparison of these forecast 
probabilities with the actual responses (where these were not missing) then allows 
assessment of the fit of item k to the model. 

For testing item 60, with all other items included, the probabilities were 
grouped into 8 intervals, with counts, average probability and relative frequency 
of a right answer as given in Table III. 


TABLE III 
Average Relative 

Group (g) Count (n,) probability (7) frequency (F,) 
0.0 - 0.1 14 0.07 0.07 
0.1 - 0.15 16 0.12 0 
0.15 - 0.2 11 0.17 0.09 
0.2 - 0.3 12 0.25 0.25 
0.3 - 0.4 8 0.33 0 
0.4 - 0.5 6 0.44 0 
0.5 - 0.6 4 0.55 0.5 
0.6 - 1.0 4 0.86 0.5 


If the item fits the model, then E nI T) + m (l-r,) should be 
9 


approximately distributed as chi-square with 8 degrees of freedom. The observed 
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value of 15.8 is significant at 5%, suggesting a failure of calibration on this item, 
and thus its non-conformity with the Rasch model. 


4. Software reliability. 


Littlewood et al. (1986) have made a thorough comparison of a number 
of model-based prediction systems for prequential probability forecasting of the 
successive inter-failure times of complex software systems. The data comprised 
136 inter-failure times ranging between 0 and 6150 seconds, and the models used 
all incorporated reliability growth (improved performance after each bug-fix). 
Some forecasting systems used optimization, some were Bayesian, others com- 
bined the two methods. The results are summarized in Table IV. 


TABLE IV 
u-plot K-S distance y-plot K-S distance 
System (sig. level) (sig. level) 
1. JM 190 (1%) .120 (NS) 
2. BJM 170 (1%) 116 (NS) 
3. GO 153 (2%) 125 (10%) 
4.L .109 (NS) -069 (NS) 
5. BL .119 (NS) 075 (NS) 
6. LNHPP .081 (NS) 064 (NS) 
7. LV .144 (5%) .110 (NS) 
8. KL .138 (5%) .109 (NS) 
9. W 075 (NS) 075 (NS) 
10. D .159 (2%) .093 (NS) 


Systems 1, 2 and 3 are all based on essentially the same model, as are 4, 
5 and 6. It appears that the method of data analysis is less important here than 
choosing a good model. Measured by prequential likelihood, the optimal system 
was 6. The authors also considered adaptive recalibration of the above systems, 
as well as Bayesian and optimizing strategies for combining them, leading in all 
cases to improvements in performance. 
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Conclusion 


The prequential method is broad in range, simple in concept, and based 
on a firm theoretical foundation. However its implementation leaves plenty of 
scope for variations, and is currently more art than science. Further work should 
lead to an improved understanding, and give guidance on good strategies of 
applying the method. Efficient computational methods or approximations will 
also be essential for routine application. 
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Introduction 


Problems of statistical inference with an infinite dimensional parameter 
space, usually a space of probability distributions over a set, are of great 
importance both theoretically and practically. The Bayesian approach to such 
nonparametric problems requires that a probability distribution be placed over 
this space. Much progress has been made in the past 15 years and the results 
have been scattered throughout the statistical and probability literature. It is the 
purpose of this paper to review the progress in this area to date with special 
emphasis on random probability measures and on results that have appeared 
since the review article of Ferguson (1974). 

The central class of distributions for use in these problems is the class of 
Dirichlet processes. Developments in the basic theory of such processes are 
reviewed in the next section. The settling of Doksum’s conjecture by James and 
Mosimann is observed in the third section on tailfree and neutral processes. 
Progress in the application of mixtures of Dirichlet processes to the Bayesian 
analysis of empirical Bayes problems, bio-assay and density estimation is pre- 
sented in the fourth section. The far-reaching extension of the basic techniques to 
problems with partially censored data is reviewed in the fifth section, with 
application to reliability and the Cox proportional hazard model. The use of 
random distributions in empirical Bayes estimation, initiated by Hollander and 
Korwar, has been extensively developed and is reviewed in the sixth section. In 
the seventh section, the problems of inconsistency of the Bayes estimates in 
Dalal’s symmetric Dirichlet model, discovered by Diaconis and Freedman, are 
presented. In the final section, various other Bayesian nonparametric techniques 
and applications are briefly touched upon. 


The Dirichlet Process 

Let % be a set, let A be a o-field of subsets of %, and let a be a finite 
nonnull measure on (%, A). Among the various methods for putting prior 
distributions on the set of all probability distributions over (%, A), the Dirichlet 
process is still central. As defined in Ferguson (1973), a Dirichlet process with 
parameter a, denoted D(a), is a random process, P, indexed by elements of A 
with the property that for all positive integers k, and every measurable partition 
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Ais.. Az of %, the random vector (P(Aj;),...,P(A;)) has a k-dimensional 
Dirichlet distribution with parameter (a(Aj)),...,a(A;)). The basic result for this 
process is: 


Theorem 1 (Ferguson, 1973) 


If P is a Dirichlet process with parameter a, and if, given P, Xj,...,Xn is 
a sample from P, then the posterior distribution of P given Xj,...,Xn is a 
Dirichlet process with parameter a + 6(X;), where 6(z) represents the 
distribution giving mass one to the point z. 

Two proofs of the existence of such a process were given, one non- 
constructive using the Kolmogorov consistency conditions, and the other con- 
structive, in which P is a sum of a countable number of point masses whatever be 
a. That a Dirichlet process has a representation that is discrete a.s. even if æ is 
continuous is a striking fact that has been the subject of several papers, e.g., 
Blackwell (1973), Berk and Savage (1979), Basu and Tiwari (1982). A new 
construction simpler than that of Ferguson has been given by Sethuraman and 
Tiwari (1982). 


Theorem 2 (Sethuraman and Tiwari, 1982) 


Let Yj, Yo,... be i.i.d. with a beta distribution, Be(M,1) M > 0, let Z,, 
Z,... be iid. Fo, and let {Y;} and {Z;} be independent. Define P) = (1 - Yj), 
and P = Y; ... Yp-1(1 - Yn) for n > 1. Then, P = &P;6(Z;) is a Dirichlet 
process with parameter a = MFp. 

Throughout, we shall use M = a(%) to represent the total mass of a, and 
Fo = a/M to be the prior guess at P. The latter phrase stems from the fact that 
from the definition, P(A) has a beta distribution, Be(a(A), M-a(A)), so that 
EP(A) = a(A)/M = F)(A). In particular, the posterior guess at P given a 
sample from P is, according to Theorem 1, Fn = p,Fo + (1 - p,) Fn, where Fn is 
the empirical process and p, = M/(M + n). As a consequence, suppose that it is 
required to estimate with squared error loss the mean p = f rdP(z) of an 
unknown distribution P on the real line based on a sample Xj,...,Xn, with prior 
P € D(MF)), where Fo has finite first moment. Then, p is finite a.s. and 


E(u | Xis- Xn) = Paho + (1 - p,) Xn 


where py is the mean of Fo, and Xn is the sample mean. (In subsequent 
discussions, Bayes procedures are assumed to be taken with respect to squared 
error loss, unless stated otherwise.) 

In regard to this simple problem, there was an error in Ferguson (1974) 
in stating that yp is finite a.s. if and only if Fo has a finite first moment. That the 
only if part is false was pointed out in Doss and Sellke (1982), who obtain the 
following results on the tail behavior of P. Let F(t) = P((—oo, t). 
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Theorem 3 (Doss and Sellke, 1982) 
If F € D(MF)), then 


exp(—hy(t)) < 1- F(t) < exp(-h)(?)) 
for sufficiently large t a.s. 


where M(i) = 2 log | log(1- FD) - F) and M() = {0 - Fo) 
x [log(1 - Fo(t))]?} 


As an example of this behavior, Yamato (1984) obtains the distribution 
of p when Fo is a Cauchy distribution. 


Theorem 4 (Yamato, 1984) 


If F € S(MFo) where Fo is a Cauchy distribution, then the random 
variable u = f zdF(z) has the same Cauchy distribution. 

In Cifarelli and Regazzini (1979) and in Hannum, Hollander and 
Langberg (1981), methods of finding the distribution of the mean of a Dirichlet 
process are reported. 

A number of simple applications were presented in Ferguson (1973) such 
as estimating a distribution function or a median, mean or variance. In the two- 
sample problem of estimating P(X > Y), the Mann-Whitney-Wilcoxon rank- 
sum statistic was seen to appear naturally. A number of other similar 
applications have appeared since that time. We mention a few. 

Yamato (1975) obtains a Bayes estimate for d(F, G) = 
f (FQ) - G(z))?d( F(z) + G(z))/2, based on independent samples from F and G 
which are given independent Dirichlet priors. Campbell and Hollander (1978) 
provide estimates of the rank of X} among Xj,...,Xn based on Xj,...,Xs, S < n, 
when sampling from a Dirichlet process F. Hollander and Korwar (1980) find a 
Bayes estimate of A(z) = G'(F(z)) - z, a measure of the difference between F 
and G at z, based on independent samples from each, with G known and F 
having a Dirichlet prior. Dalal and Phadia (1983) consider the problem of 
estimating r = E{sign((X - X')(Y - Y’))}, a measure of dependence for a 
bivariate distribution, where (X, Y) and (X’, Y’) are independent samples from 
the distribution. The Bayes estimate is computed using a Dirichlet prior in 2- 
dimensions, and Kendall’s tau is seen to appear naturally. Zalkikar, Tiwari and 
Jammalamadaka (1986) obtain a Bayes estimate for A(F) = P(Z > X+ Y), 
where X, Y, Z are i.i.d. chosen from F, based on a sample from F, which is given 
a Dirichlet prior. 

These are all examples of estimation problems. The difficulty of using 
Dirichlet priors in hypothesis testing problems was mentioned in Ferguson 
(1973), but Susarla and Phadia (1976) show how to test Hp: F < Fo for a given 
distribution function Fo using a Bayes approach. The idea is to replace the usual 
zero/one loss function with the smoother loss L(F, ag) = {(F - Fo)" dW and 
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L(F, a) = [(F - Fi) dW, where ag (resp. a,) is the action accept (resp. reject) 
Hp, and W is an arbitrary weighting measure. This idea also extends to multiple 
decision problems. 


Relation to Tailfree and Neutral Processes 


Let P4, Po,... be a sequence of finite measurable partitions of % such 
that for all n > 1, Pyr4y is a refinement of Pp. We say that a random 
probability measure P on (%, A) is tail-free w.r.t. the sequence {Pn} if the sets 
of random variables {P(B| A): A E€ 9,1, B E Pn} for n = 1, 2,... are 
independent. (Here Po = {%}.) The notion of tailfree processes goes back to 
Freedman (1963), Fabius (1964) and Kraft (1964). In the dyadic tailfree process, 
each set of the partition Pp is cut into two pieces in the partition P,44. 

One drawback of using a tailfree process as a prior is that the behavior of 
the estimates depends on the choice of the partitions used to describe the process. 
This is true with one notable exception. The Dirichlet process is tailfree with 
respect to every sequence of partitions. Moreover, if a process is tailfree with 
respect to every sequence of partitions then it is either a Dirichlet process or a 
limit of Dirichlet processes or concentrated on two nonrandom points (Fabius, 
1973). 

There is another class of prior distributions that shares this property to a 
lesser degree, the processes neutral to the right, introduced by Doksum (1974). A 
random distribution function F(t) on the real line is said to be neutral to the 
right if for every m and h < t < ... < tm, the random variables 1 - F(t), 
(1 — F(t2))/(. - F(4)),..4(1 - F(tm))/(1 - F(tm-1)) are independent. This is 
equivalent to saying Y(t) = —log(1 - F(t)) has nonnegative independent incre- 
ments. The basic theorem is: 


Theorem 5 (Doksum, 1974) 


If F is neutral to the right, and if X,,...,Xn is a sample from F, then the 
posterior distribution of F given Xj,...,Xz, is neutral to the right. 

Basically, a process neutral to the right is tailfree with respect to every 
sequence of partitions {P,} such that 9,44 is obtained from P, by splitting the 
rightmost element, (tn, 00) into two pieces, (tn, tn4i], (tn41, œ). Thus, a 
Dirichlet process on the real line is neutral to the right, and neutral to the left, 
etc. Doksum (1974) conjectured that this property characterizes the Dirichlet 
process. This has been settled affirmatively. 


Theorem 6 (James and Mosimann, 1980) 


If F is neutral to the right and neutral to the left, then F is a Dirichlet 
process or a limit of Dirichlet processes or concentrated on two nonrandom 
points. 
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For another characterization of the Dirichlet process in terms of 
Johnson’s sufficiency postulate or learn-merge invariance, see Boge and Mocks 


(1986). 


Applications of Mixtures of Dirichlet Processes 


In the paper of Antoniak (1974), a number of Bayesian statistical 
problems with Dirichlet process priors were discussed whose solution involved 
posterior mixtures of Dirichlet processes, in particular empirical Bayes, bio-assay, 
regression, discrimination, and classification problems. The computational 
difficulties involved were such that Antoniak treated only very small size 
problems. Since then, Monte Carlo methods due to Kuo (1986) have been 
developed making Bayes solutions to these problems feasible. See Dalal (1978) 
and Dalal and Hall (1980) for a discussion of approximation of arbitrary random 
probability measures by mixtures of Dirichlets. 


1. Bayes empirical Bayes 


Consider first the Bayes empirical Bayes problem. In the usual empirical 
Bayes setting, it is assumed that unobservable parameters 0;, 7 = 1,...,n are taken 
independently from an unknown distribution G, and that associated with each 9,, 
a random variable X; is chosen independently from a distribution with density 
f{2|9;) for j = 1,...,n. The problem is to estimate one or more of the @;. Most 
procedures use X),...,X, to obtain an estimate Gn of G first and then estimate 6; 
as the Bayes estimate with respect to the prior G,. In the Bayes approach to the 
empirical Bayes problem, a prior distribution is placed on G. Berry and 
Christensen (1979) take G to be a Dirichlet process, D(a). Following Antoniak, 
the posterior distribution of G is a mixture of Dirichlet processes with parameter 
a + 46(9;) and mixing distribution H(6 | X), the posterior distribution of the 0; 
given the X;, in symbols, 


Gi xe jda + Sao) dH(0 | X). (1) 


In view of the computation difficulties, even in the simple case where f.(z|@) is a 
binomial distribution with probability of success 6 and sample size depending on 
j, Berry and Christensen suggest a couple of rough approximations to the Bayes 
rule that are easy to evaluate. 

Monte Carlo approximation of the exact Bayes estimate was considered 
by Kuo (1986a, 1986b). Let H(@) denote the unconditional marginal distribution 
of 0, 


dH(@) = fi +j- Dfa + (0p ap, (2) 
J= = 
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as given in Blackwell and MacQueen (1973). Then, from a formula of Lo (1984) 
for the posterior distribution of @ given X, 


dH(9|X) = 0 f(X;0;) HCO) / si axieane) (3) 


the exact Bayes estimate of În, say, may be written, 


J Onl f(X;|9;) dH(0;) 


6n(X) = f OndH(8|X) = EAO 


(4) 


The obvious Monte Carlo method in which vectors 1... .,0% are generated i.i.d. 
from the distribution (2) and then used to approximate the two integrals in the 
right side of (4) does not work well. In the method of Kuo, Monte Carlo is used 
only to decide which of the 0; are equal to which others according to (2). Then 
the n-dimensional integrals in the right side of (4) reduce to a product of 1- 
dimensional integrals dFo(@), which can often be integrated exactly, for example, 
if Fo(@) is taken as a conjugate prior of f{z|6). 


2. Bayesian bio-assay 


As another application, consider the bio-assay problem. Let F(t) denote 
the probability of a positive response for a subject treated at dose level t. It is 
assumed that F(t) increases with t. Suppose that n; subjects are treated at dose 
level t; and that Y; is the number of positive responses, j = 1,...,Z. It is assumed 
that the Y; are independent binomial variables with probability F(t;) of success. 
The problem is to estimate F. The Bayes approach to this problem goes back to 
Kraft and Van Eeden (1964) who use a dyadic tailfree process as a prior. 
Ramsey (1972) uses a Dirichlet process prior and obtains the modal estimates of 
F by maximizing the finite dimensional joint density of the posterior distribution. 
(This seems to be the first description of the Dirichlet process; unfortunately, it is 
in a problem where the posteriors are not Dirichlet.) 

Bhattacharya (1981) develops a large sample procedure for 
approximating the finite-dimensional distributions of the posteriors as a normal 
mixture of Dirichlet distributions. Disch (1981) considers the problem of 
estimating quantiles of a potency curve with Dirichlet process priors, and avoids 
the difficult computational tasks by suggesting approximations similar to those 
made by Berry and Christensen in the empirical Bayes problem. However, the 
methods of Kuo may be applied to this problem as well. For related work, see 


Kuo (1983, 1988) and Ammann (1984). 
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3. Bayesian density estimation 


Another application of mixtures of Dirichlet processes is to estimate a 
density, f(z), based on a sample of size n from f. Lo (1984) puts a prior on f, by 
writing f(z) = f K(z, u)dG(u) and letting G have a Dirichlet process prior, D(a). 
He obtains the posterior distribution of G as a mixture of Dirichlet processes and 
uses this to obtain formulas for the Bayes estimate of f. One of his applications 
is to the two-parameter normal kernel K = ¢(aly,0). This example was 
expanded in Ferguson (1983) who, using the representation of Sethuraman and 
Tiwari (1982), described f(z) as a mixture of normal densities, UP;O(X|.,0;), 
where the P; are as in Theorem 2 and the (.,0,) are a sample from the four- 
parameter conjugate prior for the normal. Kuo’s method was seen to provide a 
simple and effective means of performing the computations for large data sets. 
The estimates are seen to provide evidence for two suggestions: (1) for using a 
variable kernel estimate with wider windows at the tails, and (2) for using 
shrinkage estimates on the observations, namely bringing observations in toward 
the center, proportional to their distance from the center. In the paper of Kumar 
and Tiwari (1989), Kuo’s method is applied to estimating a mixture of 
exponential densities. 

Gaussian processes may also be used to generate densities. In the 
approach of Leonard (1978), a density on the interval (a, b) is written as 
exp{g(t)}/ f exp{g(z)}dz where g is a given Gaussian process. An alternate 
approach is provided by Thorburn (1986), in which the density is written as 
exp{g(t)} where g(t) is a Gaussian process conditional on f exp{g(z)}dz = 1. 


Application to Censored Data and Reliability 


An important extension of nonparametric Bayes theory is to the 
treatment of censored data. The problem of estimating an unknown cdf F based 
on censored data is usually formulated as follows. Let Xj,...,Xn be a sample 
from F, and let the censoring points, Yj,...,Yn, be random variables independent 
of the X’s. The observations are Z; = min(X;, Y;), and dj = (X; < Yj), 
j = 1,...,n, where I(A) represents the indicator function of the set A. The 
problem is to estimate F based on the observations. The usual nonparametric 
estimate is the product limit estimate, due to Kaplan and Meier (1958). 

The first completely Bayes approach to this problem was made by 
Susarla and Van Ryzin (1976) who use a Dirichlet process as a prior for F. Let 
Uy < Uy < ... < u, be the distinct observations among Z,...,Zn; let A; denote 
the number of censored observations at u,; let k(t) denote the number of u; < t; 
and let h, be the number of Z; > up 


Theorem 7 (Susarla and Van Ryzin, 1976) 


If F € D(a), then the posterior expectation of the survival function, 
1 — F(t), given the observations is 
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a(t) + hay 1 aus) + hj + àj 


E(1 — F(t)\data) = 


(5) 


where a(t) = a(t, 00), and M = a(R). 

This estimate reduces to the Kaplan-Meier estimate as the prior 
information, M, goes to zero. If there are no censored observations, the product 
term vanishes and we get the Bayes estimator of Ferguson (1973). Blum and 
Susarla (1977) complemented this result by showing that the posterior 
distribution of F given the data is a mixture of Dirichlet processes with specified 
transition and mixing measures. 

This research was generalized to prior distributions neutral to the right 
by Ferguson and Phadia (1979). With Dirichlet process priors, the updating 
mechanism of going from prior to posterior is easy for uncensored observations 
and difficult for censored observations. For prior processes neutral to the right, it 
is the other way around. Thus, the generality provided by priors neutral to the 
right make them the natural priors to use for censoring problems. Also, it should 
be noted that the estimate in Theorem 1 does not depend on the distributions of 
the Y;. Indeed, this should be the case when a Bayesian analysis is performed; in 
fact, as Ferguson and Phadia point out, the Y; may be considered as constants, 
allowing treatment of problems in which future Y; may depend upon past 
observations. 

However, if X; and Y; are allowed to be dependent, the marginal 
distribution of X may not be identifiable. Nevertheless, a Bayesian treatment of 
the problem is possible and has been carried out by Phadia and Susarla (1983), 
by assuming a Dirichlet process prior for the joint distribution of (X, Y). They 
derive the Bayes estimate of the joint distribution, which of course need not be 
consistent. See also Arnold et al. (1984). Tsai (1986) adopts a different 
approach by taking the joint distribution of (Z, d) to be a Dirichlet process on 
IR x {0,1}, and making an independence-like assumption that makes the marginal 
distribution of X identifiable, and the Bayes estimate of F consistent. Since the 
marginal distribution of F is not Dirichlet under this assumption, his resulting 
Bayes estimate is quite distinct from that of Susarla and Van Ryzin in the 
independent case. 

For a review of the area up to 1980, see Phadia (1980b). For consistency 
of (5) and the product limit estimate, see Susarla and Van Ryzin (1978b) and 
Phadia and Van Ryzin (1980). For related results, see Gardiner and Susarla 
(1982, 1983), Colombo, Costantini and Jaarsma (1985), Rao and Tiwari (1985), 
Johnson and Christensen (1986) and Berliner and Hill (1988). 


1. Application to reliability theory 


A useful generalization of the gamma process for statistical problems has 
been introduced independently by Dykstra and Laud (1981) and Lo (1982). Given 
a nondecreasing left-continuous function a on [0, oo) with a(0) = 0, V(t) is said 
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to be a gamma process with parameter œ if V(t) is a process with independent 
increments such that for all t£ > 0 the distribution of V(t) is G(a(t), 1), the 
gamma distribution with shape parameter a(t) and scale parameter 1. Given a 
nonnegative function 8 on [0, oo), the weighted gamma process with parameters 
a and ĝ, G(a, 8), is then defined as the process r(t) = Jio Psd Vs). Its 
elementary properties include i 


Theorem 8 (Dykstra, Laud and Lo) 


If r € (a, 8), then ris a process with independent increments, E(r(t)) 
= J io,4P(9 105), and Vand) = S y g8’) da(s). 


Dykstra and Laud use this process (which they call an extended gamma 
process) as a prior distribution on the hazard rate function in nonparametric 
reliability problems; that is, they assume that the survival function, S(t) = 
1 — F(t), has the form S(t) = ezp{- J io, a7(s)4s}; where r E€ (a, 2). 


Theorem 9 (Dykstra and Laud) 
If r € Gla, p), then ES(t) = ezp{- f ogl + B(s)(t- s))da(s)}. If 


Xj;--.An is a sample from S, then the posterior distribution of r given the 
censored data Xj, > tse 2 In is (œ, ß*), where 


B*(t) = bLA + BME, (2, - A). 


They also show that the distribution of r given an uncensored sample is a 
mixture of weighted gamma processes, and examples are given showing the 
computational problems involved can be solved. This approach gives probability 
one to the absolutely continuous distributions, and Bayes estimates of the hazard 
rate and the cdf are derived. 

Since in the above construction the gamma process has nondecreasing 
sample paths, the resulting survival distribution has increasing failure rate (IFR). 
Ammann (1984, 1985) puts this approach in a more general setting by repre- 
senting the hazard rate as a function of the sample paths of nonnegative processes 
with independent increments which consist of an increasing component as well as 
a decreasing component. This results in a broad class of priors over a space of 
absolutely continuous distributions which contain IFR, DFR and U-shaped 
failure rate survival distributions. Ammann finds the posterior Laplace 
transforms of these processes based on data that may contain censored 
observations, and applies his approach to the competing risk model as well. 

The Bayesian analysis discussed above may be extended to incorporate a 
covariate using the Cox proportional hazard model as was done by Kalbfleisch 
(1978). Independent observations Xj,...,X, are made with respective covariate 
vectors W,,...,Wn according to the survival distribution, 


S(r | w) = So(2)P” 
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where (3 is the vector of regression parameters, and Sp(z) is the baseline survival 
distribution. While the main interest in covariate analysis centers around the 
estimation and hypothesis testing of £6, considering Sp(z) as a nuisance 
parameter, it is still of interest to estimate Sp(z) by itself. Writing So(z) = 
erp{—A(z)}, Kalbfleisch takes A(z) to have a gamma process prior, and carries 
out the estimation of 8 by determining the marginal distribution of the 
observations as a function of 8 with Sp(z) integrated out. Thus, the treatment is 
semi-parametric and semi-Bayesian. This approach was generalized to allow 
1 — So(z) to be an arbitrary process neutral to the right by Wild and Kalbfleisch 
(1981). For related results, see Padgett and Wei (1981) and Mazzuchi and 
Singpurwalla (1985). 


Empirical Bayes Estimation 


Bayesian methods have been found to be useful in the non-Bayesian 
treatment of empirical Bayes problems. Suppose we are at the n + 1% stage of 
an experiment, and information is available not only from the current stage but 
also from the n previous stages. Let Fj, Fo,...,Fn41 be n+1 distributions on the 
real line, and for j = 1,...,n+1, let x; = Xirs- Xjm.) be a sample of size m, from 


F;. As a prior, we assume that Fj,...,F,41 are a sample from the Dirichlet D(a) 
where a = MGp. We wish to estimate F,,41(¢) with squared error loss. 


LErni Ë) = f (Fala) - F(2))?dW(a) (6) 


for some finite measure W. If we know M and Gp, this becomes a 
straightforward Bayes problem whose solution is 


Fta(t) ee es | Go(t) + (1 - n41) nti (t) (7) 


where g, = M/(M + m,) and F(t) is the sample distribution function based on 
X. If @ is unknown, we cannot use this estimate, but we may use X,,...,X, to 
help estimate M and Gp. 

Korwar and Hollander (1976) and Hollander and Korwar (1977) consider 
the case where M is known and Go is unknown. They estimate Go(t) by the 
average of the sample distribution functions of X,-yX_; and propose the 
following empirical Bayes estimator of F,41: 


Anti) = dpi DIRLO + (1 ~ 4.44) Fn4a(d)- (8) 


We say that this sequence of estimates is asymptotically optimal relative 
to a class of Dirichlet process priors if the Bayes risk of H,41 given a, call it 
r(a, Hn+1), converges to the Bayes risk of the Bayes estimate (7), call it r(q), 
whatever be a in the class. Since asymptotic optimality is a weak property, one 
wants rates of convergences. Korwar and Hollander prove: 


BAYESIAN INFERENCE 137 


Theorem 10 (Hollander and Korwar, 1977) 


ro, Hapi) = Nafi + dy E pal- 95) 7/02] 


When all the m, are equal, say to m, this reduces to r(a)(1 + M/(mn)). 
Thus, {Hn41} is asymptotically optimal relative to the class of Dirichlet priors 
with fixed M, and the rate of convergence is 0(1/n). Hollander and Korwar also 
treat the empirical Bayes estimation of a mean, with similar results. 

In their paper on testing hypotheses, Susarla and Phadia (1976) also 
consider the empirical Bayes extension of their problem using the method of 
Hollander and Korwar. In addition, they allow M as well as Go to be unknown, 
and, using an estimate of M based on the estimate of Korwar and Hollander 
(1973), exhibit an empirical Bayes estimate that is asymptotically optimal 
relative to the class of all Dirichlet priors. The extension of the Hollander and 
Korwar result to unknown M was made in the equal sample size case by 
Zehnwirth (1981), using a new estimate of M. The estimate is as follows. Let Fn 
denote the F-statistic in the one-way analysis of variance based on X}... Xn (Fr 
= ratio of the mean sum of squares between populations to the mean sum of 
squares within populations). 


Theorem 11 (Zehnwirth, 1981) 


m/(1 - Fn) — M in probability as n — oo. 


The extension to empirical Bayes estimation of a distribution function 
based on censored data was made by Susarla and Van Ryzin (1978a) when all 
sample sizes, m, are 1, obtaining asymptotically optimal estimates at rate 
0(1/n). Since the proposed estimate was not necessarily nondecreasing, Phadia 
(1980a) suggested using a simpler somewhat better estimate of Go, which has the 
desirable property that the resulting empirical Bayes estimate is nondecreasing. 
This problem has also been treated by Ghorai (1981), taking a gamma process 
for —log(1 — F(t)) and obtaining asymptotically optimal estimates at rate 0(1/n). 

In the uncensored case, Ghosh, Lahiri and Tiwari (1989) propose an 
empirical Bayes estimator of F,41 that uses both the past as well as the current 
data for estimating Gp. Their proposed estimator is given by (7) with Go 
replaced by 


p n+1 ` n+l 
d) = DA- gk EU- a) (9) 
j=1 j=1 


Letting Anat denote the resulting estimator, they derive the following result. 
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Theorem 12 (Ghosh et al., 1989) 


=l 


nyt, a) = (a)| 1 + teu 350 - p) | (10) 


That this is a uniform improvement on the estimator in Theorem 10 is 
easily seen using Schwartz’ inequality. Moreover Ghosh et al. have established 
the optimality of the weights used in (9), namely that the Bayes risk of Hy41 is 
smaller than the Bayes risk of any other estimator that is a linear combination of 
the F;. In addition, they make a similar improvement to Zehnwirth’s estimator 
of M by allowing it to depend upon X,,,, as well as by allowing the sample sizes 
to differ. 

We comment briefly on other papers in the area. Hollander and Korwar 
(1976) treats the empirical Bayes estimation of P(X > Y) in a two-sample 
problem. Phadia and Susarla (1979) treat the same problem allowing right 
censored data, Ghorai and Susarla (1982) consider the empirical Bayes 
estimation of a density using Lo’s estimate. Ghosh (1985) and Tiwari and 
Zalkikar (1985a, b) consider empirical Bayes estimation problems for general 
estimable parameters of degree one and two. ‘Tiwari, Jammalamadaka and 
Zalkikar (1988) treat the empirical Bayes version of the paper of Gardiner and 
Susarla (1983). 


Random Symmetric Distributions; Problems of Consistency 


An extension of the family of Dirichlet processes to the family of 
Dirichlet invariant processes was introduced by Dalal (1979a). Let § = 
{ Gyre 9,} be a fixed finite group of measurable transformations from % into 
itself. Let œ be a G-invariant finite non-null measure on %. A random 
probability measure P on (%, A) is said to be a Dirichlet invariant process with 
parameter a, in symbols P € (a), if P is G-invariant (surely) and if for every 
partition (Aj,...Am) of % made up of measurable invariant sets, 
(P(A,),.-..P(Am)) E D(a(A,),....a(Am)). Dalal and others (Tiwari, 1988; 
Hannum and Hollander, 1983) give constructive definitions along the following 


lines. Let P € (a) and define P* as P*(A) = (1/52, e gP(gA). Then the 
distribution of P* depends only upon a*, where a*(A) = (1/ŅE JE ga(gA), and 
P* € DG(a*). 


When § consists of only the identity transformation, DG(a) corresponds 
to the usual Dirichlet process, D(a). When G is generated by g(x) = —z, DG(a) 
gives probability one to distributions that are symmetric about zero. Dalal 
(1979a) derives several properties of the Dirichlet invariant process and applies 
the theory to the estimation of a distribution function known to be symmetric 
about a known point, 6. The analysis is extended in Dalal (1979b) to the case 
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where 6 is unknown but given a prior distribution v independent of P. See Dalal 
(1980) for an expository article on these problems. 

An important analysis of these results, both theoretically and practically, 
has been given by Diaconis and Freedman (1986a, b). Such estimates may not 
be consistent throughout the support of the prior, as detailed in Theorem 13 
below. The first example of an inconsistent Bayes estimate was given by 
Freedman (1963). A simple example of this phenomenon, Ferguson (1973), may 
be described as follows. 

Let the prior distribution of F be the mixture, F = p H + (1 - p,) D(a), 
where pọ the prior probability of H, is 1/2, where H is the uniform distribution 
on the interval (0, 1), and where a = MH with M = 1. The support of F is the 
set of all distributions on [0, 1]. The distribution of the distinct observations 
among a sample Xj,...,Xn from F is the same when F = D(a) as when F = H. 
Thus, as long as the observations are distinct, the posterior distribution of F 
given Xis.. Xn is p,H + (1 - p D(a + X6(X;)), where p, the posterior 
probability of H, is easily computed to be p, = n!/(n! + 1). If ever two 
observations are exactly equal, then the possibility of H disappears and F has the 
posterior distribution D(a + L6(X;)). Now, suppose that the true distribution is 
continuous on (0, 1). No matter how non-uniform this distribution may be, the 
Bayes estimate of F converges to U(0, 1). 

Freedman and Diaconis (1983) have a positive result along the lines of 
this example: If F is a mixture of D(a,) with a, = MjF;, and if the Mj are 
bounded, then the Bayes estimate of F is consistent. In the example above, one 
can think of H as a Dirichlet process with M = oo, so although F is a mixture of 
Dirichlets, the M; are not bounded. In Dalal’s model, even if the true distri- 
bution is symmetric, the Bayes estimate may oscillate indefinitely between two 
wrong values. 


Theorem 13 (Diaconis and Freedman, 1986a, b) 


Let 0 and F be independent, with 6 having a standard normal 
distribution, and F € DG(a) symmetric about zero, where a = MFp with Fp the 
standard Cauchy distribution. Then there exists a symmetric density, h(x), with 
a maximum at zero and bounded support, such that if the true distribution of the 
X; has density h, then the Bayes estimate of # does not converge. 

Doss (1984) provides a deep extension of the analysis of these problems 
from symmetric Dirichlet priors to symmetric priors neutral to the right. Doss 
(1985a, b) considers the problem of estimating a median in a different nonpara- 
metric Bayes framework. Let F(z) be a distribution function with median zero, 
let 0 be a real number, and let Xj,...,Xn be a sample from F(z- 0). To place a 
prior distribution on F that chooses median zero distributions with probability 
one, let œ be a finite non-null measure, written as a = MFo, where Fo is a 
distribution function with median zero, and suppose for simplicity that Fo has no 
mass at zero. Let a_ and a, denote the restrictions of a to (—oo, 0) and (0, oo) 
respectively. Choose F_ and Fy independently from D(a_) and D(a,) 
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respectively, and let F(t) = (F(t) + Fy(t))/2. Thus, Fis a random distribution 
function such that F(0) = 1/2; denote the distribution of F by D*(q). 


Theorem 14 (Doss, 1985a) 


Let 0 and F be independent, with 6 € v and F € D*(a), and assume 
that Fo has continuous density f. Given 0 and F, let X = (Xj4,...,.Xn) be a 
sample from F(z- 6). Then the posterior distribution of 0 given X is 


dv(0|X) = o( X)[M*F,(X; - 8)] MCX, 6) dv(0), 


where M(X, 0) > = I'(M/2 + nF,(0))C(M/2 + n(1 - F,(0))), Fn is the empirical 
distribution function of X, II” represents the product over the distinct X;, and 
c(X) is a normalizing constant. 

Doss shows that if the true distribution of the X; is discrete, the Bayes 
estimate of @ is consistent. However, if it is continuous, then the Bayes estimate 
can converge to a wrong value, it can oscillate indefinitely between two wrong 
values, or the set of its limit points can be dense in R. 

Hannum and Hollander (1983) have derived the Bayes risk of Dalal’s 
(1979a) estimate of the distribution function under DG(a@), and have compared it 
to the risk of Ferguson’s (1973) estimator under D(a). This enables them to (i) 
assess the savings in risk obtained by incorporating known symmetry structure in 
the model, and (ii) provide information about the robustness of Ferguson’s esti- 
mator against a prior for which it is not Bayes. Yamato (1986, 1987) and Tiwari 
(1988) used the Dirichlet invariant process prior to derive the Bayes estimator of 
estimable parameters of an arbitrary degree. 


Other Applications 


Our survey is by no means complete. We mention a few other selected 
results and applications in this final section. Binder (1982) considers finite 
population models in which a population {Y}4,..., Yy} consists of a sample from 
F € D(a). A sample, y,,...,y,, is then taken from {Yj,...,Yy}, and the Bayes 
estimate of LY; is derived. The asymptotic distributions are found in Lo (1986). 
Problems of fading confidence bounds for a distribution function have been con- 
sidered by Breth (1978), who finds recursive methods for computing 
P(u; < F(t) < v, for j= 1,..., m) for fixed numbers {u,}, {v,} and {t;} when F 
isa ’ Dirichlet Brice In a con naton paper, Breth (1979) applies the method 
to finding confidence intervals for quantiles and the mean, and also treats 
Bayesian tolerance intervals. Tamura (1988) applies Dirichlet process methods to 
auditing problems. 


1. Linear Bayes estimation 


The useful idea of restricting attention to a linear space of estimates in 
Bayesian nonparametric problems is due to Goldstein (1975a, b). Such estimates 
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may require less knowledge of the prior and be much easier to compute than 
Bayes estimates without much loss of efficiency. As an example, consider the 
problem of estimating a mean u = f zdP(z) within the class of linear functions, 
jp = a+ Xb;X;. The Bayes solution is 


~ _  M ny 
k= my nto t myn A Where 


E(o2) 


<E and M = ——— s. 
oe Blu?) = (ug) 
Here, o2 = f 22dP(x) - p2 is the variance of the random distribution. This 
estimate is formally identical to the Bayes estimate with the Dirichlet prior, 
Theorem 1, with however a new interpretation for the parameter M. In addition, 
the only information needed to be elicited from the prior are the three quantities, 
E(u), E(u?) and E(o2). These ideas were further developed by Zehnwirth (1985) 
in treating estimation with censored data, by Poli (1985), who finds the best 
linear predictor in a multivariate regression model and specializes to a Dirichlet 
prior and to a normal/Wishart mixture of Dirichlets, and by Kuo (1988) in 
estimating the potency curve in Bayesian bio-assay. 


2. Sequential problems 


A number of papers treat sequential nonparametric problems from a 
Bayesian viewpoint. Hall (1976, 1977) in treating sequential search problems 
with random overlook probabilities allows the distributions of the overlook proba- 
bilities to be Dirichlet or a mixture of Dirichlet. Ferguson (1982) discusses k-stage 
lookahead rules and modified rules in some nonparametric sequential estimation 
problems with Dirichlet priors. Clayton and Berry (1985) treat the finite horizon 
one-armed bandit with the unknown arm producing observations from a Dirichlet 
process. In a sequential testing problem, Clayton (1985) assumes that in sampling 
from F € D(a), the payoff if you stop at n is maz( E(X|Xq,...,Xn), v) — nc, where 
v and c > 0 are constants. He shows that the optimal stopping rule is bounded 
if the support of a is bounded, and he conjectures that this is true even if the sup- 
port of œ is unbounded. Christensen (1986) obtains a similar result for the 
problem of sampling without recall from a distribution F € (a) and constant 
cost of observation. Betro and Schoen (1987) consider the problem of sampling 
with recall and constant cost from a distribution F assumed to be a simple 
homogeneous process neutral to the right. 


3. Point processes 


Lo (1982) considers the problem of estimation of the intensity measure y 
of a nonhomogeneous Poisson point process based on a random sample from this 
process. He shows that if the prior distribution for y is a weighted gamma 
distribution G(a, 6), then given a sample Nj,...,Nn of n functions from this 
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process, the posterior distribution of y is again gamma, (a + UN;, P/(n8 + 1)). 
Lo also shows that the posterior process converges weakly to the Brownian 
bridge. 

Another paper of Lo (1981) describes an application to shock models and 
wear processes. A device is subject to shocks occurring randomly at times 
according to a homogeneous Poisson point process N(t) with intensity y. The jth 
shock causes a random amount X; of damage, assumed to be i.i.d. F on [0, 00). 
For the prior distribution, y and F are chosen to be independent, with y € a 
gamma distribution (À, 0), and F E€ D(a). In the posterior distribution based 
on a single observation of N up to time T, y and F are still independent, with 
y € GA + MT), 0 + T) and F € D(a + N). This readily yields Bayes 
estimates of y and F. 

Johnson, Susarla and Van Ryzin (1979) present an application to the 
Bellman-Harris age-dependent branching process. Each individual z born has a 
random length of life Az and reproduces at death a random number €, of 
offspring, where the (Az, €,) are iid. from Gx P. The prior distribution of G 
and P are taken to be independent Dirichlet processes with parameters a, and 
æ, and Bayes estimates of G and P are developed based on an observation of the 
process through time T starting with one individual. 
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HIERARCHICAL AND EMPIRICAL BAYES MULTIVARIATE 
ESTIMATION 


Malay Ghosh*, Department of Statistics, University of Florida 


Abstract 


This article reviews and unifies the hierarchical and empirical Bayes 
approach for estimating the multivariate normal mean. Both the ANOVA and 
the regression models are considered. 


Introduction 


Empirical and hierarchical Bayes methods are becoming increasingly 
popular in statistics, especially in the context of simultaneous estimation of 
several parameters. For example, agencies of the Federal Government have been 
involved in obtaining estimates of per capita income, unemployment rates, crop 
yields and so forth simultaneously for several state and local government areas. 
In such situations, quite often estimates of certain area means, or simultaneous 
estimates of several area means can be improved by incorporating information 
from similar neighboring areas. Examples of this type are especially suitable for 
empirical Bayes (EB) analysis. As described in Berger (1985), an EB scenario is 
one in which known relationships among the coordinates of a parameter vector, 
say § = (ives) allow use of the data to estimate some features of the prior 
distribution. For example, one may have reason to believe that the 0,’s are iid 
from a prior 7)(A), where ro is structurally known except possibly for some 
unknown parameter A. A parametric empirical Bayes (EB) procedure is one 
where À is estimated from the marginal distribution of the observations. 

Closely related to the EB procedure is the hierarchical Bayes (HB) 
procedure which models the prior distribution in stages. In the first stage, 
conditional on A = 4, @,’s are iid with a prior 7,(A). In the second stage, a prior 
distribution (often improper) is assigned to A. This is an example of a two stage 
prior. The idea can be generalized to multistage priors, but that will not be 
pursued in this article. 

It is apparent that both the EB and the HB procedures recognize the 
uncertainty in the prior information, but whereas the HB procedure models the 
uncertainty in the prior information by assigning a distribution (often 
noninformative or improper) to the prior parameters (usually called 
hyperparameters), the EB procedure attempts to estimate the unknown 
hyperparameters, typically by some classical method like the method of 
moments, method of maximum likelihood etc., and use the resulting estimated 
priors for inferential purposes. It turns out that the two methods can quite often 


*This paper is dedicated to Professor D. Basu on the occasion of his 65th birthday. The 
research is partially supported by NSF Grant Numbers DMS 8701814 and DMS 8901334. 


152 M. Ghosh 


lead to comparable results, especially in the context of point estimation. This 
will be revealed in some of the examples appearing in the later sections. How- 
ever, when it comes to the question of measuring the standard errors associated 
with these estimators, the HB method has a clear edge over a naive EB method. 
Whereas, there are no clear cut measures of standard errors associated with EB 
point estimators, the same is not true with HB estimators. To be precise, if one 
estimates the parameter of interest by its posterior mean, then a very natural 
estimate of the risk associated with this estimator is its posterior variance. 
Estimates of the standard errors associated with EB point estimators usually 
need an ingenious approximation (see, e.g., Morris, 1981, 1983), whereas the 
posterior variances, though often complicated, can be found exactly. 

The above ideas will be made more concrete in the subsequent sections 
with the aid of examples. Ours is an expository article which compares and 
contrasts the EB and the HB methods for multivariate normal linear models. 
The outline of the remaining sections is as follows. In the next section, we 
address the problem of estimating the multivariate normal mean. EB procedures 
for such problems are discussed quite adequately in Efron and Morris (1973), 
Morris (1981, 1983) and Casella (1985). However, the interrelationship between 
the EB and the HB procedures for such problems is not discussed in these papers. 
Lindley and Smith (1972) introduced and provided a detailed discussion of the 
HB approach for estimating the multivariate normal mean. However, there is no 
mention of the EB approach in their paper. 

Deely and Lindley (1981) compared and contrasted the EB and the HB 
procedures much in the spirit of the discussion in the preceding paragraphs. 
However, unlike the present article, they did not emphasize simultaneous 
estimation problems, nor did they incorporate discussion of multivariate normal 
models. 

In the third section, we consider the regression problem. The EB and the 
HB methods are contrasted both for the balanced and unbalanced linear models. 
This section is largely a review of the work of Lindley and Smith (1972) as well 
as Morris (1981, 1983). However, for the unbalanced case, our calculations go 
beyond those of Lindley and Smith (1972). It is our belief that the present 
calculations will shed more light on some of the EB approximations of Morris 
(1983). For the balanced case, the reader is also referred to Berger (1985). 

Extensive development of the EB methodology began with Robbins 
(1951, 1955), who called problems of the above type compound decision problems. 
In Robbins’s terminology, an EB procedure is one where Xis -Xp are the past 
data about 6,,...,9,. The past data should be used together with the current 
data to infer on a current 0; The terminological distinction between the EB and 
compound decision problems will be ignored in this article, and the term 
empirical Bayes will be used to cover problems of both types. Also, Robbins’s 
procedure is a nonparametric EB procedure in contrast to the parametric EB 
approach taken in this paper. 

The term hierarchical Bayes was first used by Good (1965). Lindley and 
Smith (1972) called such priors multistage priors. As noted earlier, the latter 
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used the idea very effectively for estimating the vector of normal means, as well 
as the vector of regression coefficients. 


Estimation of the Multivariate Normal Mean 


This section is devoted to a comparison of the EB and the HB procedures 
for estimating the multivariate normal mean. We begin with a simple example. 


I. C onaitonal on ĝis. 18m let Xj... Xp be independent with X; ~ 
N(0;, a°), i = 1,...,p, o? (> 0) being known. Without loss of generality, 
assume g“ = 1. 


II. The @,’s re independent NH A); 4. As, pP priors. Write @ = 


? 


(0,,.. 0) » x= (X>.. si)" and g = om ot)? 


The posterior distribution of 0 given X = ç is then 
N((1-B)z + Bp, (1-B)I,), where B = (A+1)}. Accordingly, the posterior mean 
(the usual Bayes estimate) of 0 is given by 


E(Q|X = z) = (1-B)z + By. (1) 


In an EB or a HB scenario, some or all of the prior parameters are 
unknown. In an EB set up, these parameters are estimated from the marginal 
distribution of X which in this case is N(p, B Bl »)- A HB procedure, on the other 
hand, models the uncertainty of the unkown prior parameters by assigning 
distributions to them. Such distributions are often called hyperpriors. We shall 
consider the following three cases. 


Case I. Let p =... = p, = p (say), where p (real) is unknown, but A ( > 0) is 
known. Based on ‘the marginal distribution of X, X is the UMVUE, MLE and 
the best equivariant estimator of p. Accordingly, from (1), an EB estimator of 0 
is given by : E 

oi) = (1-B)X + BXL,. (2) 


The estimator given in (2) was proposed by Lindley and Smith (1972). 
They used a HB approach to arrive at the estimator given in (2). The procedure 
is described below. 

Consider the HB model under which (i) conditional on @ and p, X ~ 
N(8, I); Gi) conditional on u, 8 ~ N(ulp AJ,); Gii) p is uniform on (—0o, oo). 
Then the joint (improper) pdf of X, @ and p is given by 


fu) x eA- 0f] A” enf- Ado ut ff) 
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1 
The factor A 2” could have been left out in (3), but will be needed for later 
calculations. 
Integrating with respect to p in (3), it follows that the joint (improper) 
pdf of X and @ is 


Kz, 9) « ex -5(87 De - 0 r+ 2 p)! (4) 


where D = AA+), -pJ with J = 1 I Recall that B = (A+1)`}. It 
follows from (4) that the posterior pk teh of 0 given X = zis N(D tz, D’). 


Since D! = (1- B)I, + Bp Ja one gets 
E(8|X = z) = (1-B)z + Bz1,; (5) 
VIX = x) = (1-B)I, + BP J, (6) 


A naive EB approach as noted earlier uses the estimated posterior 
distribution N((1-B)z + Bzl,, (1-B)J,) to infer about 0. A comparison of (2) 
and (5) reveals that the EB and the HB approaches yield the same point estimate 
of 9, but the naive EB approach estimates the posterior variance by (1- B)» 
which is an underestimate when compared to (6). This point is discussed more 
fully below. 

Based on (3), the posterior distribution of @ given z and yp is 
N((1-B)z + Bul, (1-B)J,). Also, integrating with respect to @ in (3), it follows 
that the joint (aprope) distribution of z and p is given by 


1 
fen) x B ext Ale ul, | | (7) 


It follows from (7) that the posterior distribution of » given X = z is 
N(z, (Bp) t). Hence, one may note that 


(1 - BI, = HV(GIX, pX); (8) 
Bp J, = ViBul,IX] = Vi(1 - B)X + Bul, |X 
= VE(@IX, »)IX. (9) 


Thus a naive EB procedure ignores estimating V[E(6@|X, »)|X] which amounts to 
ignoring the uncertainty involved in estimating the prior parameters when 
estimating the posterior variance. 

It is shown in Lindley and Smith (1972) that the risk of 60) is a 


uniformly smaller than that of X under the squared error loss L(@, a) = || - all ; 
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However, there is a Bayes risk superiority of U) over X which is described below. 


Theorem 1 
Consider the model X|? ~ N(9, J,) and the prior 8 ~ N(ul,, AJ,). Let 
E denote expectation over the Jont distribution of X and @. Then, osdi the 
matrix loss L,(@, a) = (a-9)(a-9)7, and writing 6, as the Bayes estimator of 8 
under L,, 
EL,(9, 8g) = (1-B)I,3 
EL,(9, ÔER) = (1-B)I, + BPJ, (10) 


Next assuming the quadratic loss L,(@, a) = (a-@) TQ(a-8), where Q is a known 
non-negative definite (n.n.d.) weight matrix, 


EL,(8, X) = tr(Q), EL,(8, 93) = (1-B)tr(Q); (11) 
EL,(0, 6433) = (1-B)tr(Q) + B tr(Qp1J,). (12) 


Proof. Note that 9, = (1-B)X + Bpl, It is immediate that EL,(9, X) 
E{(X-8)(X-8)"] = EU,) = J, and EL,(9, Êg) = ELVAN] = E-B) 
(1-B)J,. Also, since marginally X ~ N(y, (Bp) H 


EL (9, 803) = EL, (9, ôn) + E (85-803) Âi 
= (1-B)I, + BE (Xn) 1,12 
= (1-B)], + Bp ‘Jp. 


This completes the proof of (10). To prove (11) and (12), write [,(@, a) = 
(d-a)"Q(6-a) = tr(QL,(9, a)) and use (10). 


Remark 1. It follows from (10)-(12) that E[L(0, X) - L(@, AGA] is 


nonnegative definite for each i = 1, 2. Accordingly, 6) has smaller Bayes risk 
than that of X both under the matrix loss L,, and a fortiori the quadratic loss 
La. To our knowledge, this particular optimality of the Lindley-Smith estimator 
has not been pointed out before. 

The perfect agreement between the EB and the HB point estimators of 0 
in Case I is an exception rather than the rule. We now consider cases II and III 
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which reveal that the point estimators of § can also differ under the two 

approaches. 

Case II. Assume that p is known, but its components need not be all equal. 

Moreover, this time A is unknown. The marginal distribution of X is 
2 

Nu, B pA) Then ||X-y|| is complete sufficient, and is distributed as 


2 
B pr Accordingly, for p > 3, the UMVUE of B is given by (p-2)/||X-pl| .- 
Substituting this estimator of B in (1), an EB estimator of @ is given by 


—9 p- 
oF =(1-—? X + p 
( IX P IIX- all? ~ 
p-2 
= X - —+—,, (X-p/). (13) 
IX- al? ~~ 


This is the celebrated James-Stein estimator (James and Stein, 1961). The EB 
interpretation of this estimator was given in a series of articles by Efron and 
Morris (1972, 1973, 1975). The most popular version of this estimator takes 
p= 0. 
i It is shown in James and Stein that for p > 3, the risk of 19) is smaller 
than that of X under the squared error loss. However, if the loss is changed tọ 
the arbitrary quadratic loss L, of Theorem 1, then the risk dominance of @ 

over X does not necessarily hold. Indeed, it is well-known that (see, e.g., Bock, 
1975, or Berger, 1975) that under the loss Ls, 19) dominates X if (i) t(Q) > 
2ch (Q) and (ii) 0 < p-2 < 2[tr(Q)/ch,(Q) — 2], where ch (Q) denotes the 


largest eigen-value of Q. 


The Bayes risk of AEk is, however, smaller than that of X under the 
losses L, and L, the model given in Theorem 1, and the prior N (p, AI y: As 
before, let E denote expectation over the joint distribution of X and g. The 
following theorem is proved. 


Theorem 2 
Let X|? ~ N(@, J,) and @ ~ N (p, AJ,). Then for p > 3, 
EIL (9, ÔC = 1- B(p-2)p 41, (14) 
AL 4(8, 9) = ‘a - B(p-2)p4r(Q). (15) 
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Proof. To prove (14), use the identity 


FIL, (8, 643))] = ELL, (8, êp) + Eê aE (45-072) 71. (16) 
Next write 
2 
A(85-98))G5-88))7\ = 4(6 - ery (X-X - p) r| (17) 


2 
Marginally, X ~ N(p, B “Ls Hence, ||X- || is complete sufficient, while 
2 2 me : ‘ 
(X- w(X- WTNIX- pl? = BX- p(X - wT/EUX- yll?) is ancillary. 
2 
Hence, using Basu’s Theorem (see Basu, 1955), (X - p)(X - p)"/ IX- pl] is 
2 
distributed independently of ||X — p|| . Hence, 
BUT, = E(X- p)\(X - p)" 
2 2 
= EUX - wll {(X - aX - D MX- pii Y 


2 2 
= E(X - wll EX - AX - w) 7X - pll). 
2 —2 
Now using £||X - p|| = Bp, E||X - wll = Bop - 2)! for p > 3, one gets 
T 4-1 
EX — p(X- e) MX- wll] = 2 I; (18) 


E(X - p(X - w)T/X- wll’) 
= H(X - (X - DTX - wll LEX wll) 
= (p "I,) Bp - 2). (19) 
It follows from (17)-(19) that 
(85-8) 85-89) T 
= B’(B'T,) - 2B(p-2)p ‘I, + B(p-2)p “I, 


oe = =l 
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Combining (10), (16) and (20), one gets (14). The proof of (15) is immediate 
from (14) by writing L4(0, a) = tr[QL,(@, a)} 


Remark 2. Taking Q as a matrix with its (i, 1)** element equal to 1 and the rest 
zeroes, it follows that the i‘* component of ĝ 2) dominates X; when one compares 
their Bayes risks. This co-ordinatewise Bayes risk dominance of 6 over X 
appears in Efron and Morris (1973). One can derive (15) from their work by 
using an orthogonal transformation. The dominance of #);6 over X under the 
matrix loss L, has not been pointed out before, but the approach appears in 
Reinsel (1985) for a more complex EB problem. 


Remark 3. Efron and Morris (1973) found it convenient to define the concept of 
relative savings loss (RSL). Denote the given prior by € and the Bayes risk of an 
estimator e of 0 under the prior € and the loss L, by r(€, e). The RSL of â) 
with respect to X is defined by 


RSL(O\2)s X) = Ié, O)) - (E, Op)I/Ir(E, X) - (E, êp) 
= 1 -[r(é, X) - (E, EIIE, X) - E, êp). (21) 


This is the proportion of the possible Bayes risk improvement over X that is 
sacrificed by using @ 2) rather than the ideal estimator p under the prior é. It 
follows from (11), (15) and (16) that RSL(6); X) = 2/p for an arbitrary n.n.d. 
non-null matrix Q. Efron and Morris (1973) proved the result when Q = J, as 
well as when the (i, i) element of Q is 1 and the rest zeroes (i = 1,...,p). For 
the matrix loss L,, the RSL concept of Efron and Morris (1973) can be 


generalized to get 
RSLOG)s X) = (HE X)- 6 Oa] IE, OGD) - xE Onl 
= (BI,) *(B(2/p))1, = (2/p)L,- (22) 
Suppose now we consider a HB approach in this case, where conditional 


on @ and A, X ~ NÇ, L), and conditional on A, 8 ~ N(p, Al). Also, let A 
have marginal pdf g)(A). Then, the joint pdf of X, @ and A is 


2} + : 2 
Rab A) œ ee-ie -A| AP ezf- A - pll 906A). 23) 


As before, the conditional distribution, of 0 given z and A is 
N((1-B)z + Bp, (1-B)J,), where B = (A+1) . But integrating with respect to 
0, the joint pdf of X and Á is 


1 
fa A) x (AH)? ezh- gray lle- all | 9A) (24) 
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-1 
Since B = (A+1) , the joint pdf of X and B is of the form 
f(a B) œ B? ex}-} B Ile- pl] 9B). (25) 


The HB approach of the above type was first proposed by Strawderman 
(1971), and was later generalized by Faith (1978). Assuming the Type II Beta 
density for A, namely g,(A) « A™1(14.A) (™+™ where m (> 0) and n(>0), 
it is easy to see that 


La -1 2 
fiz, B) œx BP-B)" en- } B Ile- all | (26) 


Now, using the iterated formula for conditional expectations, 


Np 


Elz) = EEB, 1 2) = (1- B)z + By, (27) 
where 


2 il =] 2 
B= (Bla) = [By e- Ble- ull | aB 
, a 
T biir -1 2 
+ f p? t” t1-B)” ex + B\|z - pl| | ap. (28) 
, E 


Strawderman (1971) considered the case m = 1, and found sufficient conditions 
on n under which the risk of 8 2) is smaller than that of X. His results were 
generalized to a certain extent by Faith (1978). 

We consider also the case m = 1, and interpreting (26) as the posterior 
pdf of B given z, find the posterior mode of B as 


F l 2 
Buyo = min((pt+2n-2)/||z—- pll , 1). (29) 


Substituting this estimator of B in (1), one gets the estimator 


s f : : 
OV) = (1- Byo)X + Byou = X - Byo(X- p) (30) 


of 0. The special choice n = 0 leads to the positive part James-Stein estimator 
which is known to dominate the usual James-Stein estimator (see Lehmann, 
1983, p. 302). This is intuitively very clear since the usual James-Stein estimator 
substitutes the UMVUE of B in (1), and this UMVUE can take values exceeding 
1 with positive probability while 0 < B < 1. This deficiency is rectified by 


Biro: 
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Case III. The model is similar to the one in Case I, except that now H (real) and 
A (>0) are both pokig De that marginally X ~ Mul, Bl p) Where B 


= (A+1)*. Hence, (X, ¥(X; -X) i is complete sulficient; so that the UMVUE’s 
of u and B are given ee tue by X and (p-3)/ 3 (X; xy, Substituting these 
estimators of p and B in (1), the EB estimator of gi is given by 


= X- — (X - Xp) (31) 
2 (Xr 


This modification of the James-Stein estimator was proposed by Lindley (1962). 
Whereas, the original James-Stein estimator shrinks X towards a specified point, 
the modified estimator given in (31) shrinks X towards a hyperplane spanned by 


l. 
~p Z : 
The estimator â) is known to dominate X for p> 4. Its Bayes risk 
under the LZ, and L, losses are not known however. We now prove a theorem to 


this effect quite in the spirit of Theorems 1 and 2. 
Theorem 3 
Assume the model and the prior given in Theorem 1. Then, for p > 4, 
ELL (8, DEDI] = L- B3 U e): (32) 
FL (9, ÔER) = t(Q) - B(p-3)(p-1) AQU; -0 )]. (33) 


Proof. First write 


EIL, (9, GR] = EIL (8, 95)) + EOG) - êa)ÂER - ôa) T. (34) 


We write 


~ 


3 


Je X1,) + B(X- ply (35) 


9) dp =(2- ped 
2 (xX 
1 


Now using the independence of X - X1, and X, and using the fact that X ~ 
N(p, (Bp) *), one gets from (35), 
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|(@8) - 8,08) - 88)" 


2 
= -X1,)T| + B(Bp) Jy 6 
- s |(»- y) e, amy + an Jp (6) 


_ 3 Pn, 
Next using the independence of (X-X1,)(X-X1,)7 / 2 (XE) with 

p = 

33(X-X) (again by applying Basu’s Theorem) and the facts that 

1 


E ((X-X1,)(X-X1,)"] = B+,- *J,), while (XX) ~ Bx, it follows 
from (36) that for p > 4, 
|(0) - (08) - 4s)" 
= BE(L,-p',)- 2B(p-8)(r-1) (1-7) 
+ (p-8) B(-3) (p-1) (L0) + Be, 
= BI,- B(p-3)(p-1) (L02) (37) 


Combining (10), (34) and (37), one gets (32). The proof of (33) is immediate 
from (32). 

We now proceed to find the HB estimator of @. Consider the model 
where (i) conditional on @, p and A, X ~ MQ, I,); (ii) conditional on p and A, 
0 ~ Mul,, AL); (ili) marginally p and A are independently distributed with pu 
uniform on Eee, oo), and A has uniform improper pdf on (0, co). Then the joint 
(improper) pdf of X, 6, p and A is given by 


fiz, 9, p, A) « ep ble- ola 2P ezp -z 8 - pl if | (38) 


Now integrating with respect to p, it follows from (38) that the joint (improper) 
pdf of X, @ and A is 


fla, 9 A) x AZP exp |-yg-D-14)? D(6-D42) - 2 — 3 (2-3) 
S Ys P ee se) RN 2 2(A+1) 4 r ’ 


(39) 
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where D is defined after (6). Recall Dt = (1-B)I T Bp} J, Hence, 
conditional on z and A, @ ~ N[(1-B)z + Bal,, (1-B)L, + Bp! J,|. Also, inte 
grating with respect to @ in (39), one gets the joint pdf of X and A given by 


—$(P~1) i p 2 
fi A) « (Att) ea Stara) | (40) 


=l 
Since B = (A+1) „it follows from (40) that the joint pdf of X and B is given by 


lip- 
f(z, B) œx par) exp È B Dera B? 
1 


lip- 
= p20? 5) exp È B (273) | (41) 


It follows from (41) that 


ki 
E( B| x) = J PO) erp È BY (ara) dB 
1 
0 


Eä 
es J pa 5) exp È BY 2) dB; (42) 
1 


0 


lı 
E(B?|z) = J BOD) erp È By (era) dB 
1 
(0) 


14 
Z J BP) esp He ea dB. (43) 
1 
0 


One can obtain V(B\z) from (42) and (43), and use these to obtain 
E(Q|z) = z- E(Blz)(z-21,); (44) 
V(Glz) = VIEQ@IB, z)| 2] + ETV(8|B, z)| z] 


= - = _ =I 
= Vizg- B(z-21,)| a] + E[(1-B)L, + Bp J, 2] 


V(B| z)(z-21,)(z-21,)" + I, - E(Blz)(L, - pJ). (45) 
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Also, one can obtain a positive-part version of Lindley’s estimator by substituting 
P iy e . 

the posterior mode of B namely min{(p — 5) / 93 (X-X)’, 1) in (1). Morris 
1 


oe Ret approximations to E(B|z) and E(B*|z) involving replacement of 
li by a both in the numerator as well in the denominator of (42) and (43). 


The resulting approximations turn out to be E(B|z) = (p-3) /3 Sen. and, 


BPD = DD {EE} so that eD = 2s) / {dea 


Morris (1981) points out that the above approximations amount to putting a 
uniform prior to A on (-1, 00) rather than on (0, œo). Note that with Morris’s 
approximations 


EIX) = X- Tean (X - X1,) = 6433, (46) 


which is Lindley’s modification of the James-Stein estimator, while 


VOIX) = iP I x- Mx- ,)” 
(LH) ) 


nae y (I, - p J,): (47) 


Morris (1981) considered a slightly more general version of the model where 
conditional on @, p and A, X ~ N(9, œ°I p)» While the distributions of 9, yu and A 
remain the same. If one redefines B = o*/(o7+A), the only change that is 
needed in the aruana is that conditional on z and Á, 
0 ~ N((-B)z + Bīl, o 2[(1- B)I, + Bp ADE while the conditional pdf oe B 
given z, and accordingly E( B\z) and V( Bia) are modified by putting B/o? i 
place of B in the exponents. 

We now revisit the famous baseball data of Efron and Morris (1975). 
They considered the batting averages of 18 baseball players in 1970 after each 
had batted 45 times. Based on these batting averages, they estimated (in fact, 
predicted) the players’ batting averages for the remainder of the season. We used 
formulas (42) and (43) with B/o? replacing B in the exponents to get the exact 
expressions for E(9,|z) and V(@,|z). Also, we used Morris’s approximations which 
are obtained by modifying (46) and (47). The results are given in Table 1. In 
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TABLE 1. The True Values (0;), the Maximum Likelihood Estimates (Y;), the 
Hierarchical Bayes Estimates (6; Hp), the Hierarchical Bayes S.D.’s (s; yp), 
Morris’s Approximate Estimates (6; M), and Morris’s Approximate S.D.’s (s; M) 


io 8; Y; Ôsme SnB Üg? Ôm Siym Dims, M 
Ô; wBt25;, up] 0; m+2Si, ml 

1 0.346 0.395 0.308 0.046 [0.216,0.400] 0.293 0.073 [0.147,0.439] 
2 0.300 0.375 0.301 0.044 [0.213,0.389] 0.288 0.071 [0.142,0.430] 
3 0.279 0.355 0.295 0.043 [0.209,0.381] 0.284 0.069 [0.146,0.422] 
4 0.223 0.334 0.288 0.042 [0.204,0.372] 0.280 0.067 [0.146,0.414] 
5 0.276 0.313 0.281 0.041 [0.199,0.363] 0.275 0.066 [0.143,0.407] 
6 0.273 0.291 0.281 0.041 [0.199,0.363] 0.275 0.066 [0.143,0.407] 
T 0.266 0.269 0.274 0.040 [0.194,0.354] 0.271 0.066 [0.139,0.405] 
8 0.211 0.247 0.267 0.040 [0.187,0.347] 0.266 0.066 [0.134,0.398] 
9 0.271 0.247 0.260 0.040 [0.180,0.340] 0.262 0.067 [0.128,0.396] 
10 0.232 0.247 0.260 0.040 [0.180,0.340] 0.262 0.067 [0.128,0.396] 
11 0.266 0.224 0.252 0.040 [0.172,0.332] 0.257 0.068 [0.121,0.393] 
12 0.258 0.224 0.252 0.040 [0.172,0.332] 0.257 0.068 [0.121,0.393] 
13 0.306 0.224 0.252 0.040 [0.172,0.332] 0.257 0.068 [0.121,0.393] 
14 0.267 0.224 0.252 0.040 [0.172,0.332] 0.257 0.068 [0.121,0.393] 
15 0.228 0.224 0.252 0.040 [0.172,0.332] 0.257 0.068 [0.121,0.393] 
16 0.288 0.200 0.244 0.041 [0.162,0.326] 0.252 0.070 [0.112,0.392] 
17 0.318 0.175 0.236 0.043 [0.150,0.322] 0.247 0.073 [0.101,0.393] 
18 0.200 0.148 0.227 0.045 [0.137,0.317] 0.241 0.077 [0.087,0.395] 
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what follows the true values @,’s refer to the baseball players’ actual batting 
averages for the remainder of the season. Also, ô; Hp and 6; m denote respectively 
the HB estimate of 0; and Morris’s approximate estimate of 0; The standard 


errors associated with Ô; Hp and ô; m are denoted respectively by S; HB and SiM 
It turns out that 


(180?) o (Xo). = 0.976, 


(180?) p2 “Oi. HB- Di = 0.299, 


and 


—1 18 2 
(1807) D Om- 9) = 0.286 


so that Morris’s approximations serve well as point estimates. However, Morris’s 
(1981) approximations to the s.d.’s are consistently larger than the actual ones, 
leading thereby to wider confidence intervals. It appears that Morris (1981) has 
reported that 6. i, HB ’s and s; i HB ’s in his Table 1, p. 31, but his notations seem to 
suggest that these are 6. i MS “and S;, MSE 

So far we have ona dered only the case when the sampling variance ø? is 
known. In a more realistic set up, c? is unknown. In such instances, one 
approach is to first find the Bayes estimator of 0 assuming g? to be known. Next 
find an estimator of o”, and substitute this estimator in the Bayes estimator 
found earlier. Berger (1985) discusses this approach. A slightly different classical 
EB approach can be found in Ghosh and Meeden (1986) or Ghosh and Lahiri 
(1987). These methods do not take into account the uncertainty involved in 
estimating o°. This deficiency can be rectified by putting a prior distribution 
(often non-informative) on g? as well. 

One important example is the unbalanced one-way ANOVA model. We 
propose a HB analysis with an unknown øg? as well as unknown parameters 
involved in the prior distribution of 0. We find it convenient to reparametrize 
into 0? = r! and A = (Ar) 1. The remainder of this section is an adaptation of 
the arguments of Ghosh and Lahiri (1988). 


Assume that 


(a) conditional on @, m, À and r, the random variables Xis. Xy and U 
are mutually independent with X% ~ NO, (rn?) 


3 


(i= 1,...,p), while U ~ r XN» (N= $an; 
e,e T1 
(b) conditional on m, A and r, ĝ ~ N mp (Ar) L) 


(c) marginally, M, A and R are independently distributed with M ~ 
uniform(—oo, 00), R has pdf g(r) x r?°, while A has pdf k(à) œ I 
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Remark 3. Note that we have changed the notation from p to m. If one assigns 
the noninformative prior g(A, o?) œ (07), then noting that r = (07)! and Ar 
= A, one gets the prior on R and A as given in (c). It is possible to assign 
gamma priors (informative or noninformative) on R and AR as in Ghosh and 
Lahiri (1988), but we have decided to sacrifice that generality. 

To identify the above model with an unbalanced one-way random effects 
ANOVA model, write Y; = m+ 7; + ij OS besi 2 = 1). -»P): Here, 7,’s and 


e,s are mutually independent with rps lid NÇ, (Ary) and e;s lid N(O, r). 


Write 0; = m+ 7, X% = Y; = = hr ay, j (G = ‘dil and U = 


3 


aa iil Vy -¥y. Clearly, Xpand 5 U) is minimal sufficient with joint 
distribution given in (a). 

Under the above model, the joint pdf of Xir. Xp U, 0, M, R and A is 
given by 


1 l; y- 
fas u 8, m, nA) x exp |-b rE? me oy | 2" 


X er -tru)u 2-9) i 


ein anes l- DrD 2al; - m) (ar) (48) 


Integrating with respect to m in (48), one gets the joint pdf of X, U, @, R and A 
given by 


1 E 
fla, 9, 7, A) x BOD erp |- 5197 De - 297Gr + 27Gz + w)| 


1 1 1,-1)- 
x yah?) 1 aalr 1) zi (49) 


where G = Diag(n,,.. +My), D= G+ AL, - Dd 1J p) Next integrating 
with respect to r in (49), it follows that the joint "pdf of X, "U, 0 and A is 


f(z, u, 9, A) 


1 
Lí N-p)—1_ Líp-1)— —3(N+p-1)-1 
la de "(07 D9 - 207Gr + 27Gr + w) 
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1 N4p-3 
x (@-D*G27D@-D7G)+2™G-GO'Gs+a2 (60) 


It is clear from (50) that the conditional pdf of @ given z, u and À is 
multivariate-t with location parameter D7! Gz, scale parameter 
(N-1)4[27(G- GD'G@)z+ujD' and degrees of freedom N-1. On 
simplification, one gets 


E p 
D= gta, nln tA?) KE 
x (x = Diag[(n, + Ay gaa i + a) (51) 


D'G= Diag n(n, + A)T, n (n, + ay") 


-1 | (n, +A)? 
+ Ni 3 n(n; + dy) 
=1 (n,+A) | 


x [n(n + Ay peun (i + A) 4); (52) 


~ 


D*Ge = [n(m +A) ta, + AHA) 1B,,....0,(n, A) te, + A(n, +A) ET, 
(53) 


p Aep 
where 7, = ( È nn; + ay”) ( È rn; + ry 1; Further after much 
= 


simplifications, one can write 
z“ (G- GD" G)z 
Pp Pp —l; ? 2 
=À n(n, + A) 12? - n(n; + A)! n(n. + A) tz: 
{mln +A’ -È Enla A) (Venda; + aye) f 
= Q)(z) (say). (54) 


Integrating with respect to @ in (50), one finds the joint pdf of X, U and A given 
by 
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Kz, u, à) « ya) p) 


x CO + E pi u)D 3 
z a gry Q (2) + Pn oi (55) 


P P 
Using | D| « fo; + XY E n(n; + a) it follows from (55) that the 
1 i=l 


conditional pdf of A given z and u is 


1 = ms 
fla, u) o ROUT Cn; + ay ah 
1 


p ans liy- 
x {Ends +AT} oa + A. (56) 


From the properties of multivariate-t, it follows that E(0|z, u, A) = D! Gr, given 
in (53), and V(6|z, u, A) = (N-3)"[Q,(z) + uD. One obtains now E(6|z, u) 
and V(@|z, u) by using (56) and the formulas 


E(Q|z, u) = ELE(G@l|z, u, A)|z, ul; 
(57) 
V(Olz, u) = VIE(Glz, u, A)|z, u] + E[V(Glz, u, A)|z, ul. 


As noted already, the posterior mean of @ is given by (53) for known 4. 
Ghosh and Meeden (1986) used a classical EB procedure to estimate A and used 
this estimator of À in (53) to obtain an estimator of 6. Although, the resulting 
estimator of 9 was quite satisfactory for point estimation purposes (see Ghosh 
and Lahiri, 1987), the method suffered from the earlier criticism of not modelling 
the uncertainty in A. The Ghosh-Meeden procedure was not particularly suitable 
for the construction of credible intervals or sets. 


Shrinking Towards Regression Surfaces 


In the preceding section, the sample mean was either shrunk towards a 
specified point or a subspace spanned by the vector ly The present section 
generalizes the ideas of the preceding section by shrinking the sample mean 
towards an arbitrary regression surface. This can be achieved by using either an 
EB or a HB approach. The HB approach is discussed in detail in Lindley and 
Smith (1972) with known variance components. Morris (1983) provides a 
thorough discussion of the EB procedure. We attempt a synthesis between the 
two, and argue that Morris’s EB procedure is indeed an attempt to approximate 
a bonafide HB procedure, and is clearly superior to a naive EB procedure. 
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We begin with Morris’s set up, except that we assign distributions on the 
unknown hyperparameters, rather than estimate them on the basis of the 
marginal distributions of the observations. The following model is proposed. 

(A) Conditional on 6, b and a, let Xırs -Xp be independently distributed 
with X; ~ N(6,;, V,), i = 1,...,p, where the Vs are known positive 
constants; 

(B) Conditional on b and a, ©,,.-.,0, are in- dependently distributed with 


O; ~ N(22 b, a) (i = 1,...,p), where z,,...,z, are known regression 


e e e P 
vectors of dimension r and bis rx1. 


(C) B and A are marginally independent with B ~ uniform(R’) and A 
~ uniform(0, co). We assume that p> r+3. Also, we write Z% = 
(21s - 2p); G = Diag( V,,...,V,) and assume rank (2) = r. 


Now the joint (improper) pdf of X = ake O = (O,,...,0,) T, B 
and A is given by 


zk 2 
Rz, 0, b, a) OC exp ao) TEA) a 2" exp p l2 - Zb | | (58) 


Integrating with respect to b in (58), one finds the joint (improper) pdf of X, O 
and A given by 


fiz, 8, a) 


1 
—3(-7) ~ 
x a?“ “exp (4(2-0)7G "(z-0) - AoT, - Z(Z* Z) 1279) (59) 


Write F != G! + oy a aZray1Z") Then, one can write 
(2-9) 7G (2-8) + ao 10%, - A472 12") 
= 07E19-207G'2+2'G's 
= (2 - EG) TE O - EG*2) + 2°(G*-G*EG)z (60) 
From (59) and (60) it follows that 
E(Q|z, a) = EG*z; V(Qlz, a) = E. (61) 


Write u; = V;/(a+V;) (i = 1,...,p), and D = Diag(l-u,...,1-u,). Then, on 
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simplification, it follows that 


=f 
B= a(l,- D) + (1,- D)AZ™D2 Z%all, - D)} (62) 
EG = D + (L - DAZTDZ 2"D; (63) 
EG} zg = [(1-u,)2, + u,2/44,...,(1-a pt + u le (64) 


a =] 
where b= (Z'DZ) (Z"Dz). Then, 
Gl- GEG’ = a"[D- D&Z" DZ) *Z" D. (65) 
Hence, 


21(G1- GEG )z 
—] I 2 P y T HUS 
=a 2 (1-u,)2; re (> (1-u;)z;z;) (Z D2) (> (1-u,)2,z) 
= 1 =1 as 


= Q,(z) (say). (66) 
Combining (59), (60) and (66), the joint pdf of X and A is given by 


1 1 


=a 


fle, a) œ JEP a 2” exp|40,)| (67) 
Writing F= G taf a ‘I p and using Exercise 2.4, p. 32 of Rao (1973), one gets 
£ Z 
|E*|= + |a(Z72)| 


ZT a(Z" 2) 


=|Fl|a(Z"g - 27 F12| + |a(272)| 


x a? UC g v)}|zTpz| (68) 


It is clear from (67) and (68) that 
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fila) x è" {fa + vy} ea Haa 9) 


Now writing U; = V,/(A + V) (i = 1,...,p), using (69), and the iterated 
formulas for conditional expectations and variances, one gets 


F{O,|z] = E[E(O;lz, A)|z] = E(1-U,)2; + Uiz; la; (70) 


V[O;lz] = VIE(O;lz, Ala] + EVO; Ala 
= V{(1-U,)2; + UT ble] + EAU; + AU?zI(ZTDZ) zdal 


= VU{2, - zT + BV(I-U) + V{U0-U)22(Z7DZ) zid; (71) 


CorlO;, O,|z,] 


= Col U2; - 270), U,(2;- 2) b)lz] + HAU,U,22(Z" DZ) “zd. (72) 


Morris (1983) provides approximations for E(Q,|z) and V(O,|z), i = 1,...,p. He 
estimates the parameter a from the marginal distribution of Xan.. „X, by 
employing some non-Bayesian method, and substitutes this estimate in the 
expressions for E[O,|z, a] and V[O;|z, a] instead of finding posterior expectations 
and variances of functions involving A. Thus, using Morris’s method, E[O,|z] is 


2 


approximated by (1- úz; + tz? = z; — tz; - 27h), while V(O,|z) is 


A —] 
approximated by u,(z;— z; Th) + V(1-a)(1 + ů;z; TZ" DZ) A ae: E i 
i P 
the above v; = [2/(p-r-2)]d2(V+a) + (V4), i= 1,....p, V= Vi Va)! 
= 


P ea a 2 3 
+ E (Và), D = Diag(1-i,,...,1-%,), and $ is obtained from ĝ by 
=] 
substituting the estimator of a. The v,’s are purported to estimate V(U,|z)’s. It 
is not clear whether such an approximation can be justified very rigorously since 


b also involves the ú;s and ú; is not distributed independently of the z; — zi b. 


We examine now how formulas (70) and (71) work in estimating the 
batting averages of Ty Cobb during 1905-1928. Morris (1983) took a similar 
uncer ine except that his major emphasis was to examine whether Ty Cobb 
was “ever a true .400 hitter”. To make our results comparable to those of Morris 
ee we fit a quadratic i Ty Cobb’s batting averages, that is we take b = 
(b,, bo, ba) 7, z; = (1, 4,2 aye , i = 1,...,24. In the average year 1 refers to 1905, 
and year 24 icles to 1928. We provide in Table 2 the actual batting averages 
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TABLE 2. The Actual Batting Averages of Ty Cobb (Y,), the Number of 
Times He Was at Bat (n,), the HB Estimates (6; yp), the Corresponding 
S.D.’s (s; yp), Morris’s Approximate Estimates (0; mM), and the Corresponding 
S.D.’s (s; m)- 

to om Y; OnB Sup Dius, bim Sim Dim 2s, M 


A 


0; uB+2S;, Hp] 0; M+2S; ml 


1 150 .240 .298 .020 [.258, .338] .303 .026 [.251, .355] 
2 350 .320 .325 .015  [.295,.355] .327 .018 [.293, .363] 
3 605 .350 .344 .013 [.318,.370} .345 .015 [.315, .375] 
4 581 .324 .337 .014 ([.309, 365} 339 .015 [.309, .369] 
5 573 .377 .366 .014 ([.338, .394]  .366 .015 — [.336, .396] 
6 509 .385 .373 .014 [.345,.401] .373 .015 ~—«([.343, .403] 
7 591 .420 .393 .016 ([.361,.425])  .393 .017 —([.359, .427] 
8 553 .410 .391 .015 [.361,.421] .391 .015 [-361, .421] 
9 428 .390 .384 .015 ([.354,.414] .385 015 [.355, .415] 
10 345 .368 .379 .015 [.349,.409] 379 .016  [-347, .411] 
11 563 .369 .379 .014 ([.351,.407] .380 .014 [-352, .408] 
12 542 .371 .381 .014 ([.353,.409] .381 .015 [.351, .411] 
13 588 .383 .386 .013  ([.350,.412]  .386 .014  [.358, .414] 
14 421 .382 .385 .015 [.355,.415] 385 .015 [.355, .415] 
15 497 .384 .385 .014 ([.357,.413] 385 .014 [.357, .413] 
16 428 .334 .364 .016 ([.332,.396] .365 .018  [.329, .400] 
17 507 .389 .383 .014 [.355,.411]  .383 .014 [.355, .411] 
18 526 .401 .385 .015 [.355,.415}) .384 .015 = ([.354, .414] 
19 556 .340 .355 .014 [.327,.383]  .356 .015  [.326, .386] 
20 625 .338 .350 .014 [.322, .378] .351 .014 [.323, .379] 
21 415 .378 .362 .015 ([.332,.392] .361 + .016 _—‘[.329, .393] 
22 233 .339 .342 .016 [.310, 374] 342 .018 [.306, .378] 
23 490 .357 .342 .015 [.312,.372]  .342 .017 [.308, .376] 
24 353 .323 .322 .015 [.290, 352] 322 .019 [.284, .360] 
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(Y;) of Ty Cobb, the number of times he was at bat (n,), our estimated batting 
averages (6; Hp), the corresponding standard errors (s; HB)» Morris’s approxima- 
tions (6; i for these batting averages, and the corresponding approximate 
standard errors (s; y). Following Morris, we took V; = (.367)(.633)/n;, i = 
1,...,24. 


It follows from Table 2 that )°>%4,(8;7_3 - Y)? = .007377 and 


? 


ane — ô; y = .008244. Thus, Morris’s approximations lead to about a 


11.0% increase in the overall mean squared error. Also, the s; MS though mostly 
very close to Si HB ’s can lead upto a 30% increase. More porani our two 
standard deviation confidence intervals around the posterior means are usually 
mugh tighter than the corresponding ones given in Morris (1983). However, as 
mentioned earlier, Morris’s EB procedure is much superior to a naive EB 
procedure, since the latter can seriously underestimate the actual standard errors. 
This is evidenced in our actual calculations which are not reported here. We 
should also point out that both (6; HB = 28; i, HBl s and [ĝ;, M + 2s; ņyl’s cover the 
true Ys 23 out of 24 times which is approximately 95. 8%. Also, 
(3; HB + S; gp]’S and [Â; M + $; ml’s cover the true Y;s 17 out of 24 times which 
is approximately 70.8%. Thus a normal approxiniation to the posterior distribu- 
tion is not totally out of the way. 

One of Cobb’s greatest claims to fame is that he has the highest lifetime 
batting average of any baseball player in the modern era. Ty Cobb’s actual 
overall batting average in 1905-1928 is .367. Also, Op = Dinô; wel Dein; 


= .366 and 64 = D inĝ; y/ X in; = .366. This shows that both the HB 


and EB estimates of the overall batting average of Ty Cobb essentially match the 
reality. 


It is instructive to look at the poe case of equal variances, that is, 
when Vj) =...= V, = V. Then ų =... = u, = V/(V+a) = u (say). In this 
case D = (1-u)L,, A = (1- —u)Z"Z, fe (272) 12" 2 = 6, the usual least 
squares estimate of b. Moreover, a + V = Vu‘ so that a = V(1-u)/u, Q,(z) = 
a 1(1-u)SSE, where SSE = ae i - (z za) ZID YL zizi) the usual 


I= = 


error SS. Since |da/du| = Vu *, it follows from (69) that the conditional pdf of 


= 


U given zis 


1 
5r 


2 1 2 r 
Kuz) « ((1-u)/u) u?” “(1-u) © ex -3y SSE) 


1 
= PTA erp- 4 SSE) (73) 
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It follows from (70) and (71) that 


E(Q,|z) = 2; - E(U|z)(2;—- 2/5); (74) 


V(O,l2) = V(U1a)\(2;- 228) + V- VECUID(I- (ZT) 42) (8) 


If one adopts Morris’s approximations as in the second section, then one 
estimates E( U|z) by 


Z cae | oak) oa 
U = J weap a4 SSE) du iy J yal? P erp -74 SSE) du 
0 0 
= V(p-r-2)/SSE 


and E( U?|z) by 


J yr? ” exp 44 SSB) du / J yr exp p41 SSE) du 
O 0 
= V(p-r)(p-r-2)/(SSE) 


2 2 2 
Accordingly, V(U|z) is approximated by 2V*(p-r-2) + (SSE) = [2/(p-r-2)]U . 
These calculations suggest that (Oz) should be approximated by 


L; U(2, - ae É) and V(O,|z) should be approximated by 


2 (erd )Ù (a2 nal b) + yı - 1 - 212107 “)h (76) 


The expression s? g does not agree with the expression s? given in (4.1) of Morris 
(1983) (with ihe obvious changes in his notations). It seems to us that Morris’s 
(4.1) uses his (1.17) which involves a slight oversight. We shall discuss this point 
now. 

Morris (1983) starts with an EB approach, where he assumes conditions 
(A) and (B) with V} =... = V, = V (say). With this formula for known b and 


a, the Bayes estimator of 6 i 1S given by 
Ôp = (1-u)X + uZb, u = V/(V+a). (77) 


If b and u are unknown, Morris (1983) estimates them by b and à respectively, 
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where 6 = (Z7Z) Z Ty the least squares estimator of b and & = (p-r-2) V/ SSE, 
P Tin? Ai l 

SSE = >) (X;-2z; b), the error SS. Note that & is the UMVUE of u since 
i=1 


: -1,2 
marginally SSE ~ Vu a 


Morris (1983) proposes the EB estimator ĝ EB = ô, EN pp of 9, 
where 


Ô; pp = (1-4) X, + tz{b (i= 1,...,p). (78) 
Then, 
rm 2 à i a 
EO; eB- 9;) = E0; - 9; p)” + EO; p- 9; eB) 
= Via) + E [iX + we 0). (79) 


Using the marginal independence of X; - a b and z1, and noting that V(b) = 
Vu 1(Z7Z) 1, it follows from (79) that 


‘ 2 ; _ 
EO; eB- 9) = Vi-u + E [Cui (XT $)? + Vuzi (Z I z; (80) 


Since (6, SSE) is complete sufficient for (b, u) and (X-27b)*/ SSE is ancillary, 

they are independently distributed by Basu’s (1955) theorem. Now using 
a 2 3 —1 = 

E(X zf b) = Vu (1 — z1(ZT2) z;) and SSE ~ Vu p it follows on 


simplification that 
„2 Tjy2 
H (e) (X27 5)| 


= v1 = HOEA - Vu aef z HOEN. (81) 


Combining (80) and (81), it follows that 


: r-2 -1 
EÂ; eB- 0)? = V- VP (1-212) x) u (82) 


In Morris’s (1.17), i (272) z; = r/p for every i which does not seem to be the 
case. 
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BASU’S CONTRIBUTIONS TO THE FOUNDATIONS OF 
SAMPLE SURVEY 


Glen Meeden, Department of Statistics, Iowa State 
University, Ames 


Introduction 


Whenever I read a paper by Dev I am impressed with the clarity of his 
writing and thinking. He is able to distill the essence of the topic at hand and 
present it in such a way that it seems almost obvious to me. This is particularly 
true in the foundations of sample survey where he has elegantly demonstrated the 
proper role of the sufficiency and likelihood principles. Because these principles 
fail to justify much of the current design based practice and because he has 
presented his arguments in a Bayesian context some survey samplers have chosen 
to either ignore or attempted to modify the consequence of these principles. This 
coldness to Bayesian ideas in survey sampling could be considered surprising since 
it is the one area in statistics where everyone agrees prior information should be 
used. 

In the next section, the results of Basu and Ghosh (1967), which 
characterize the minimal sufficiency partition for discrete models, will be briefly 
summarized. In the third section, the results of Basu (1969) will be summarized. 
Here he demonstrated the role of the sufficiency and likelihood principles in 
sample survey, from which it follows, that once the sample has been drawn the 
inference should not depend in any way on the sampling design. In the fourth 
section, some of the implications of these results will be noted. In particular, the 
famous Jumbo example of Basu (1971) will be discussed. It will be shown how 
Basu’s argument there suggests a pseudo-Bayesian approach to survey sampling. 
This approach is quite flexible in that one can incorporate various levels of prior 
information without specifying a prior distribution. Finally, the role of random 
sampling in survey sampling will be discussed briefly. It should be noted that 
Basu (1978) contains some further reflections on his earlier work. 


Sufficiency in Discrete Models 


For many years, in statistical decision theory, it has been an accepted 
convention, to begin by assuming the existence of a nonempty set X, equipped 
with a o-algebra of subsets of X, say p, along with P = {P,|@eQ} a family of 
probability measures on (X, 8). One of the consequences of Basu’s work (along 
with others) was to fit survey sampling into this scheme. For such a model it is 
of interest to find the minimal sufficient statistic, assuming it exists. Now, in 
general, for such models a minimal sufficient statistic need not exist. However, 
for discrete models, which includes the sample survey model, a minimal sufficient 
statistic always exists and is easy to find. 
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The triple (X, @, P) is said to be a discrete model if i) 2 is the class of all 
subsets of X and ii) each Py, is a discrete probability measure. (We are also 
assuming that for each zeX, there exists a EQ, such that P,({z}) = P,(z) > 0.) 
Note that a discrete model is undominated if and only if X is uncountable. 

Now a statistic is just a function, T, defined on X. By our choice of p 
every function T is measurable. Every statistic T defines an equivalence relation 
(x~ zx if T(z) = T(2’)) on the space X. This leads to a partition of X into 
equivalent classes of points. Since we need not distinguish between statistics that 
induce the same partition of X, we may think of a statistic T as a partition {7} 
of X into a family of mutually exclusive and collectively exhaustive parts r. 

Using the usual measure theoretic definition of sufficiency one can prove 
the following factorization theorem for discrete models: 


Theorem (Basu and Ghosh, 1967). 


If (X, 8, P) is a discrete model, then a necessary and sufficient condition 
for a statistic (partition) T = {r} to be sufficient is that there exists a real 
valued function g on X such that, for all 0e€Q and ze X 


Po(z) = 9(2)Po(7,) 


where 7, is the part of the partition {r} that contains z. 
Using this theorem, it is easy to find the minimal sufficient partition for 
a discrete model. For each zeX let 


Q, = {6| Po(z) > 0}. 


Consider the binary relation on X: “r~ r if Q, = Q,, and P)(z)/P,(2’) is a 
constant in @ for all #60, = Qy” This is an equivalence relationship on X and 
defines the minimal sufficient partition. 


The minimal sufficient statistic has an alternative characterization. For 
each zeX let L,(@) be the likelihood function, i.e. 


L (0) = P,{8) for 0eQ, 
= 0 for 02... 


L,(8) = 1,(6)/sup L,(8) 


and 


be the standardized likelihood function. Consider the mapping 
zt—+L{-) 


a mapping of X into a class of real-valued functions on Q. This mapping is a 
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minimal sufficient statistic, i.e. induces the minimal sufficient partition given 
above. 


The Sufficiency and Likelihood Principles 
in Survey Sampling 


The sufficiency and likelihood principles were widely used in other areas 
of statistics before their role in survey sampling was properly understood. The 
sufficiency principle states that if T is a sufficient statistic and T(z) = T(z’) then 
the inference about 0 should be the same whether the sample is z or z’. This 
principle has gained wide acceptance. In discrete models since the mapping 
r — L,(0) is a minimal sufficient statistic, according to the sufficiency principle 
two sample points z and z’ are equally informative if 


L(8) = L (0) for all 0. 


Note the sufficiency principle does not say anything about the nature of the 
information supplied by z. For this we need the likelihood principle which states 
that the information supplied by z is just the standardized likelihood function 
(8). 

To see the implications of these principles in survey sampling we consider 
a simple survey model. Let U denote a finite population of N units labeled 1, 
2,...,¥. Attached to unit 7 let y; be the unknown value of some characteristic of 
interest. For this problem 


i= (Y> e YN) 


is the unknown state of nature. 0 is assumed to belong to Q a subset of N- 
dimensional Euclidean space, Ry. The statistician usually has some prior 
information about y and this could influence the choice of 2. Often it is assumed 
that Q = RÙ but this need not be so. We will assume that, in addition, 
associated with each unit i is m,, a possible vector of other characteristics all of 
which are known to the statistician. We assume that the m,’s and their possible 
relationship to the y,’s summarize the statisticians prior information about y. 

A subset s of {1, 2,...,N} is called a sample. Let n(s) denote the number 
of elements belong to s. Let S denote the set of all possible samples. A 
(nonsequential) sampling design is a function A defined on S such that 
A(s)e[0, 1] and 2 A(s) = 1. Given #€Q and s = {ire taa) where 1 < i, < 

sé 


< tn is N let 0(s) = (Yigg) Suppose we wish to estimate the 


population total 


N 
y(0) = 2 y; 


with squared error loss. Note e(s, 0) will denote an estimator of y(0) where 
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e(s, 0) depends on @ only through 6(s). If the design A is used in conjunction 
with the estimator e, then the risk function is 


2 
(6; A, e) = [els 6) — 7(9)| A(s). 
Typically a frequentist sampler uses the prior information summarized in the m,’s 
to choose some design A and then looks for estimators which are unbiased for 


estimating 7(@). For such an unbiased estimator the risk function is just its 
variance. 


For such a problem a typical sample point is the set of labels of the units 
contained in the observed sample along with their values of the characteristic of 
interest. We will denote such a point by 


z = (s, z,) 


= (s, (z; v ; atia) 


when s = {i,,...,2,/,)} is the observed sample. 
Hence for a given design A the sample space is given by 


X = {(s, z,)|A(s) > 0 and z, = 6(s) for some 0eQ}. 
So for a fixed EQ the probability function over X is given by 
Po(2) = Pols, 3) = A(s) if z, = O(s) 
=0 otherwise. 
This defines a discrete model. Note that 
Qos 27 B {0|Pa(z) > 0} 
= {6|0(s) = z,} 
from which it follows that 
P(x) = Pals, z,) = A(s) if 6eQ, 
= 0 elsewhere. 


If as before, L,(-) denotes the standardized likelihood function, we see that 
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LA9) = Lis, z (8) = 1 if OEN, 
= 0 elsewhere. 


Since the mapping z — L,(-) is a minimal sufficient statistic and the likelihood 
function is constant over Q, all we learn from the observed data z = (s, z,) are 
the values of the characteristic for the units in sample and that the true 0 must 
be consistent with these observed values. 

Note that this observation is independent of the sampling design. That 
is, after the sample z = (s, z,) is observed the minimal sufficient statistic does 
not depend in any way on the value of A(s). (In fact, Basu demonstrated that 
this is true even for sequential sampling plans where the choice of a population 
unit at any stage is allowed to depend on the observed yvalues of the previously 
selected units.) Furthermore, the principle of maximizing the likelihood function 
cannot be invoked to find an estimate of the population total since the 
standardized likelihood function is constant over Q, 

In the next section some implications of these results will be discussed. 


Some Implications 


For most statisticians, perhaps the most unsettling aspect of Basu’s 
argument is his demonstration that the likelihood principle implies that the 
design probability should not be considered in analyzing the data, after the 
sample has been observed. In particular, choosing an estimator which is unbiased 
for a given design violates the likelihood principle. But from a naive point of 
view this is not surprising when one recalls the strange way probability is used in 
survey sampling. Since the characteristic y; is assumed to be measured without 
error the only way probability enters the model is through the design A. That is 
the phenomenon of randomness is not inherent within the problem but is 
artificially injected into it by the statistician. In other areas of statistics the 
statistician uses probability theory to model uncontrollable randomness while in 
survey sampling the whole analysis is based on a controlled randomness 
introduced by the statistician. 

Godambe (1966) had noted before Basu (1969) that the application of 
the likelihood principle to survey sampling would mean that the sampling design 
is irrelevant for data analysis. But he, as many other non-Bayesian statisticians 
since then, has chosen to ignore the likelihood principle and tried to justify a role 
for the design when analyzing the data. 

Scott (1977) and Sugden and Smith (1984) considered situations where 
some information available to the person who designed the sample is not 
available to the one who must analyze the data. They argued that in such 
situations the design may become informative. Although such examples are 
interesting I do not feel that they lessen the force of Basu’s argument. 

Recall that the likelihood principle in survey sampling justifies a very 
intuitively appealing notion, that is, given the observed data z = (s, z,) one just 
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learns the y,’s for t¢s and that the unsampled y,s for js must come from a 6 
which is consistent with z. So the basic question of survey sampling is how can 
one relate the unseen, O(s’) = {y jfs}, to the seen, 0(s) = {y; tes}. Without 
some assumptions about how these two sets are related, knowing 0(s) does not 
tell one anything at all about 6(s’). Presumably, for a frequentist, the design A 
along with the unbiasedness requirement is a way to relate the unseen to the 
seen. But I have never understood the underlying logic of the relationship. 

On the other hand, the Bayesian paradigm allows one to relate the 
unseen to the seen in a straightforward way which does not violate the likelihood 
principle. Let g(@) denote the Bayesians’ prior density over Q. q would be 
chosen to represent and summarize the statisticians prior beliefs about 0. Given 
the sample z = (s, z,) one then computes the conditional density of 0 given z, say 
q(6| z). This is concentrated on the set Q, and is just q with the seen, 0(s) = 
{y; ies}, inserted in their appropriate places and normalized, so it integrates to 
one over 22,. Then the Bayes estimator against q for the populational total is 


D aes T) 


1E3 


where for jfs, E (¥;|2) is the conditional expectation of y, with respect to q(6|z). 

The form of the Bayes estimator emphasizes that estimation in survey 
sampling can be thought of as a prediction problem, i.e. of predicting the unseen 
from the seen. That is, in these problems one should argue conditionally from 
the seen to the unseen. 

As was to be expected the Bayes estimator does not depend on the 
design. In most of the standard statistical decision problems an estimator is 
admissible if and only if it is a Bayes estimator or limit of Bayes estimators. 
This suggests that in survey sampling the admissibility of an estimator should 
not depend on a particular design. This was demonstrated in Scott (1975). Let 
A, and A, be two designs with A, dominating Ag, i.e. if s is such that A,(s) = 0 
then A,(s) = 0 as well. Then Scott proved if the estimator e is admissible for 
design A, then it is also admissible for design A3. 

From the Bayesian point of view the statistician should use a design 
which minimizes the overall Bayes risk. In practice such designs are very difficult 
to find but often such minimizers are purposeful designs, i.e., designs which put 
probability one on a single set. Hence Basu has elegantly outlined a coherent 
theory of survey sampling in which random sampling or more generally the 
sampling design has little or no role to play. Ericson (1969) is one example of a 
Bayesian approach to survey sampling very much in the spirit of Basu. However, 
one serious difficulty in using a Bayesian approach to survey sampling is 
specifying a realistic prior distribution. Even for those who are somewhat 
sympathetic to Bayesian ideas, choosing a prior in survey sampling is almost 
impossible because of the larger number of parameters. Hence, it would be of 
interest to have an approach to survey sampling which did not violate the 
likelihood principle, allowed one to think conditionally given the sample, and 
allowed one to incorporate various levels of prior information relating the unseen 
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to the seen without actually specifying a prior distribution. Such an approach is 
suggested in Basu’s famous Jumbo example in Basu (1971). 

Here Basu was discussing the Horvitz-Thompson estimator and other 
estimators which were suggested for some unequal probability designs. The 
Jumbo example dealt with estimating the total weight of a group of elephants 
where Jumbo was the largest. 

Following Basu, let N be the size of the herd and y; the weight of the jth 
elephant. Let m; be our best prior guess, before the sample is observed, of the 
weight of elephant i, that is, the m,’s incorporate all our prior information about 
the herd. We begin by assuming that the herd is reasonably homogeneous (in 
contrast to Basu, there is no Jumbo). Suppose a sample s with n(s) = n>1 is 
chosen and the corresponding y,’s observed. Suppose we believe that these n 
observed ratios {y,;/m,: ies} are representative of the N- n unobserved ratios 
{y,/ m; js}. Although we may not be able to define representative we have an 
intuitive idea of what it means. Furthermore, if in practice we obtained a sample 
which we believed was not representative then we would be foolish to act as ìf it 
were. 

Assuming the sample is representative then Pan suggested that 7 = 
= LS (y;/ m;) should be a good guess for y,/m; when jes’. Hence, for a typical 


IES 
unsample unit j, a reasonable estimate of y; is mr. This suggests a sensible 


j 
estimate of the population total is 


D k 2 (yi/ m) oo (1) 


tes 


This estimator can be given a pseudo Bayesian justification by creating a 
posterior distribution for the unseen given the seen which is appropriate when one 
believes the sample is representative. Suppose in the sample of n observations 
there are r distinct values of these ratios, say a,,...,@,. Let k, be the number of 
observed y;/m,’s which are a; for j = 1,...,r. Construct an urn which contains n 


balls where k. are labeled a: for j = 1,...,r. Then take as the pseudo posterior 
distribution for the N — n unobserved ratios the distribution generated by simple 
Polya sampling from the urn. To begin, a ball is chosen at random from the urn 
and the observed value is given to the unobserved ratio with the smallest label. 
This ball and an additional ball with the same value are returned to the urn. 
Another ball is chosen from the urn and its value is given to the unobserved ratio 
with the next smallest label. This ball and another with the same value are 
returned to the urn. The process is continued until all N — n unobserved ratios 
are given a value. We will call this pseudo posterior the Polya posterior for the 
unseen given the seen. The Polya posterior is a pseudo posterior because it does 
not arise from any single prior distribution over the parameter space. This is 
intuitively clear since it is data dependent. On the other hand, it does reflect the 
belief that the unseen are like the seen. Finally it is easy to check that the Bayes 
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estimate of the population total using the Polya posterior is just the estimate 
given in (1). 

Note in the special case when little is known about the herd, i.e. all the 
m;s are equal, then the estimator in (1) reduces to (N/n)} y; which is the 


te8 
classical estimator of the population total. 


In Meeden and Ghosh (1983) the estimator given in (1) was shown to be 
admissible. The proof used the stepwise Bayes technique. In the proof the Polya 
posterior played a crucial role. Hence, Basu’s argument not only gives an intui- 
tive justification for the estimator (1) but suggests a method for proving its 
admissibility. This approach can be extended to prove the admissibility of a 
variety of other estimators. (See Vardeman and Meeden (1984) for details.) 

For example, suppose the population can be stratified into various strata 
each of which is relatively homogeneous. If the sample contains units from each 
stratum then the estimator in (1) can be used within each stratum, where within 
each stratum the m,’s are assumed to be equal, to produce an estimate of the 
population total. If in a given stratum, say k, we decide to sample n, units then 
the stepwise Bayes argument shows that any set of n, units within the stratum is 
optimal. That is, we may choose our n, units by simple random sampling 
without replacement. This type of argument gives an noninformative Bayesian 
justification for a variety of the usual estimators in survey sampling along with a 
justification for choosing the sample at random. 

One can argue that it is a relatively weak justification since it justifies 
any method of selecting the sample. In spite of Basu’s arguments even some 
through going Bayesians, still admit to being attracted to the notion of randomi- 
zation even though they do not know any intellectual justification for it. I 
however find Basu’s statement on page 594 of Basu (1980), in slightly different 
context, quite compelling. 

“I have no objection to prerandomization as such. Indeed, I think that 
the scientist ought to prerandomize and have the physical art of randomization 
properly witnessed and notarized. In this crooked world, how else can he avoid 
the charge of doctoring his own data?” 
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SURVEY SAMPLING — AS I UNDERSTAND IT 
(A Development of Optimality Criterion) 


V. P. Godambe, University of Waterloo 


This was the Gold Medalist Presentation at the Statistical Society of Canada 
meetings held in Victoria, 6th June 1988. 


For since the fabric of the universe is most perfect and the work of 
a most wise Creator, nothing at all takes place in the universe in 
which some rule of maximum or minimum does not appear. 

— Leonhard Euler 


Introduction 


This is a brief overview of the historical development of the optimality 
criterion in survey sampling theory and practice. The presentation here has been 
considerably simplified for it takes for granted a fundamental result. In survey 
sampling set-up the entire data can be effectively summarized by the set of 
observed units (or individual labels) together with the corresponding variate 
values as in (1) to follow. This is a basic discovery due to Basu. He (1958) 


proved that in survey-sampling set-up (1) constitutes a minimal sufficient 
statistic. 


Definitions, Notation and the Problem 


Survey Population P is a finite collection of individuals (houses, blocks, 
farms, households, etc.), each bearing a distinctive label i; we may write 


P= {t15..5,N}; 


where N is the size of P. Variate under study such as income, size, produce, etc. 
is denoted by y. The value of y associated with the individual iis y; 1 = 1,...,N. 
We want to estimate some unknown characteristic, say the mean 


Y= Suf N 


of the population P. For this purpose a sample s of size n is drawn from P 
(sC P), using a sampling design (simple random sampling or stratified sampling, 
etc.) and the values y, i € s are ascertained through a survey. 

Problem I: To estimate Y given the data 


d= {(i, y): ics} (1) 
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and the sampling design. (A related problem is, how to use the pre-survey 
knowledge about P, particularly in the choice of a sampling design?) 

For historical reasons the above problem remained confused, until 
recently, with the following quite different problem. 

A treatment is tried n times with the following results 


Vis Yor Yn (2) 


Problem II: To estimate the average treatment effect 6 on the basis of 


the data (2). 


Fundamental Distinction 


The fundamental distinction between the two problems above becomes at 
once clear by the fact that while in problem (II), the sample mean }> vil n is the 


unbiased minimum variance (UMV) estimate for “6”, the corresponding mean 
>». ¥;/n in problem (I) is not UMV for Y, even for a simple random 
tE 8 


sampling design. 

The above phenomenon, as is now well understood, is due to the 
existence of individual labels “7” in the data (1), unlike in data (2). “Y” in 
problem (I) is the mean of the actual (survey) population. In contrast “0” in 
problem (II), is the mean of a hypothetical population generated by repeated 
(independent) trials of the treatment. 

Why was problem (I) confused with problem (II) for a long time? 

Answer: When the survey sampler arrived on the statistical stage (at 
about the beginning of this century), there already was a statistical theory 
developed by Galton, Pearson and others (to study primarily biological 
phenomena) which essentially dealt with problems akin to (II) of hypothetical 
populations. The confusion arose out of the attempts of the early survey 
samplers to use the then existing statistical theory to solve problem (I) 
concerning the actual (survey) population. 


Historical Comments 


Today’s popular understanding of statistics consists of probabilistic 
estimates, say for instance, of country’s average income, based on some random 
samples. But essentially this meshing of probability calculus with actual social 
statistics, historically proved to be far more formidable than establishing central 
limit theorem or Bayes theorem and the like. Actually both social statistics 
(Graunt) and probability theory (Pascal & Fermat) originated around 1660, but 
the meshing of the two occurred only in this century. Even in earlier history (for 
instance Jewish & Jain literature) one can find discussions of uncertain 
(probabilistic) inference; almost none relate to survey sampling. One exception I 
have temptation to quote. This is from Mahabharat, the old Indian epic (Vana- 
Parva; Nala-Damayanti Akhyan). 
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The God Kali has his eye on a beautiful princess and is dismayed 
when Nala wins her hand. In revenge an evil spirit enters the 
body of the virtuous prince. Crazed with frenzy for gambling, 
Nala loses his kingdom, and wanders demented for many years. 
Nala’s change of fortune is described in a remarkable anecdote. 


In an alien form, he has been travelling with another king, 
Bhangasuri. This latter, wanting to flaunt his skill in numbers, 
estimates the number of leaves, and the number of fruit, on two 
great branches of a spreading tree. There are, he avers, 2,095 
fruits. Nala counts all night and is duly amazed by morning. 
Bhangasuri accepts his due: 


I of dice possess the science, and in numbers thus am skilled. 


He agrees to teach this science to Nala in exchange for some 
classes in horsemanship, in which, despite his exile, Nala still 
excels. At the end of this sensational course in survey-sampling 
Nala vomits out the poison of Kali, and is restored his normal 
form. Kali, exorcised by mathematics, retires to the tree. Nala 
returns to his kingdom, offers his still faithful bride as his final 
stake and quickly recoups all his losses, and lives happily ever 
after. 


(Reproduced from History and Philosophy of Science Seminar by 
Ian Hacking) 


Neyman’s UMV-Criterion 


The first well publicized attempt to solve the survey sampling problem, 
Problem I, using the then available statistical theory developed by Galton, 
Pearson, Fisher and others was due to Neyman, 1934. Actually this theory, as 
said before, was meant for hypothetical populations of Problem II. Following 
this theory, for simple random sampling (with or without replacement) Neyman 
considered the class of unbiased estimates (for the population mean Y) of the 


form 
E ay, 
r=1 


where a, is the oven associated with the r‘* draw and y, is the observed 
value of. y at the r h draw. The variance of this estimate is minimized, Neyman 
argued, using Gauss-Markov theorem, for a,=1/n, r=1,...,n. In this sense, 
Neyman demonstrated the UMV-ness of the sample mean. Similarly for 
stratified sampling he established UMV-ness of the corresponding weighted mean. 
(Similar previous, but little known results are due to Tchuprow (1923); see 
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Bellhouse, 1987.) In retrospect it appears Neyman obtained UMV estimates by 
restricting himself to the class of estimates which depended on individual labels 2, 
only to the extent they determined the stratum to which the individual belonged. 
That is, labels were ignored within each stratum. 

For several years, following Neyman, survey samplers investigated UMV 
estimation for more sophisticated designs than stratification. For reducing 
variance of estimates Hansen and Hurwitz (1943) introduced unequal probability 
sampling. Here however individual labels (2) were used not just for stratification 
but also were used even within strata. That is, in a stratum, two individuals 
could be selected with different probabilities. 

What happened to Neyman’s UMV-estimation here? Using individual 
labels 1, Horwitz and Thompson (1952) constructed three different classes of 
estimates and investigated UMV estimation in each class. Though these latter 
investigations were inconclusive, the work clearly established that wider classes of 
estimates, than those considered by Neyman, could be constructed, using 
individual labels. 

Neyman’s introduction of UMV estimation in survey sampling led to an 
improved practice of stratified sampling, a better understanding of randomization 
and finally suggested the innovation of unequal probability sampling and general 
sampling designs. 

Here however, the UMV-criterion appeared to have reached its limits of 
usefulness. 

During 1935-1955 and even afterwards, while comparing variances of 
different estimates, possibly under different designs, proved to be rewarding, a 
search for UMV estimation led to futile confusion mentioned earlier; for such 
estimation was generally nonexistent! 

Godambe (1955) introduced a general class of label dependent estimates 
of which all the known estimates were special cases. For this class, he 
demonstrated that UMV estimation was nonexistent, for any sampling designs 
(trivial exceptions apart). Particularly the sample mean was not UMV for the 
simple random sampling design. 

Looking back, it would appear that survey samplers made considerable 
progress in sampling practice and theory, in their search for the nonexistent UMV 
estimation! But such things can happen in Science. Or one may say, survey 
samplers, in their investigation of UMV estimation, informally restricted 
themselves to the use of labels only to the extent they intuitively looked useful. 
This was the case with Neyman (1934). For a general development of this 
approach we refer to Hartley and Rao (1968). 


A New Criterion: UM&V 


Godambe (1955) also showed that in the class of all (linear) label 
dependent unbiased estimates, for the population mean Y, the HT-estimate 


enr=4 2o 9% (3) 
N 7ES 
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(due to Horwitz and Thompson, 1952), where 7; is the probability of including the 
individual i in the sample s drawn by the specified sampling design, has mini- 
mum ezpected variance. Here expectation is w.r.t. any distribution belonging to 
a class of distributions on the variate values (4,,...,Y,,---.yy) under study. This 
class of distributions, called a Superpopulation Model (SPM), is supposed to be a 
formalization of our pre-survey knowledge of the survey-population P (see next 
section for illustration). Thus w.r.t. the SPM the HT-estimate is UM&V: U = 
unbiased, V = variance, w.r.t. sampling design and & = expected w.r.t. the 
SPM. Note that many estimates in common use, such as the sample mean for 
simple random sampling and the appropriately weighted mean for stratified 
sampling are but special cases of the estimate epp in (3). Hence they are UMEV 
w.r.t. suitable SPMs. 

Actually, since much earlier than 1955, variances had been compared in 
terms of their expectations w.r.t. the SPM (Cochran, 1939). Thus in the absence 
of UMV-estimation its replacement by UMS&V-estimation seemed natural. By 
now UM6&V-criterion seems to have received a general acceptance in theory as 
well as in practice. It is also used, somewhat reluctantly though, by Model 
Theorists in sampling. 

The discovery that in survey-sampling the likelihood function is 
independent of the sampling design and hence according to the “Likelihood 
Principle” (LP) the inference must be independent of the design (randomization) 
probabilities (Godambe, 1966), gave impetus to the development of the model 
theory (Royall, 1970). This theory, to implement the above conclusion of LP, 
restricts inference/estimation exclusively to the probabilities given by SPM. 
(Such restriction was previously proposed by Brewer (1963), but he did not tie it 
to the LP. For this reason, possibly, Brewer’s work was not effective in the 
development of model theory. By this time due to the works of Barnard, 
Birnbaum and Savage, LP became respectable.) With this restriction, the model 
theory estimation, using the notation above, proceeds as follows. 

For a given (fized) sample s, in terms of y; i € s construct the class of all 
linear estimates which are SPM-unbiased for the survey-population characteristic, 
say Y. From this class, the minimum variance estimate (SPM-UMV) is 
recommended, for practical use, by the model theory. 

Now for any sample s, the SPM-UMV estimate exists for rather 
restrictive SPMs. On the other hand when design and model probabilities are 
combined one can obtain UM&V estimates (or close approximations) for far more 
flexible SPMs incorporating nuisance parameters of high dimension (Godambe, 
1982, 1983). 

Anyway even model theorists, in an attempt to make their estimation 
robust (to departures from the assumed SPM), have relied on the UMS&V-criterion 
(Brewer, 1979). Actually, from the model theorists ideological criticism and rejec- 
tion, randomization emerged with new meaning, vigor and applications. 

As I mentioned before, the UMV-criterion led to better understanding 
and practice of stratified sampling; the same thing can be said to have been 
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achieved by the UM6&V-criterion for unequal probability sampling beyond 
stratification. 

Yet, the UM&V-criterion is rather restrictive. It is generally non-vacuous 
only for fixed sample size designs. As mentioned before the HT-estimator is 
UM&V-—but generally only for fixed sample size designs. It is absurd for the 
following (rather extreme) random sample size design: with probability 1/2, a 
random sample of size “1” is drawn, and with remaining probability 1/2, the 
whole population is sampled. Now when the whole population is sampled, the 
HT-estimate (3) of the population mean Y is approximately 2Y!. Yet random 
sample size designs do occur in practice. For instance in surveys having non- 
respondents the (effective) sample size is essentially a random variate. The same 
thing happens for domain estimation. 

Just as the extension of the UMV criterion to the UMS&V criterion was 
necessary to cover label dependent estimates, a further eztension of the UM&V 
itself is necessary to cover random sample size designs. This is achieved by the 
UMSV-f criterion introduced in the next section. With this introduction, we can 
use even more flexible/broader SPMs than was possible under the UM&V 
criterion. This will be clear soon. 


UMS8V-f Criterion 


Here we present the work of Godambe and Thompson (1986a). In 
addition to the notation above we denote by zx; the covariate value associated 
with the individual i, i = 1,...,.N. We assume t= (2,,...,2;...,2,y) known and the 
SPM to be a class of distributions on (y,,....yy) satisfying the following 
conditions: 


(I) Given the covariate z, y= (4,,..5Yj--5¥Yy) are distributed mutually 
independently. 


(II) With respect to any distribution in the class the expectation &(y-0z;) 
S09 E A 


That is, under the SPM, ĝ is the regression parameter, intercept terms being 
ignored for simplicity. We define 


N 
= $ (urtz); (4) 
g is said to be a population or ybased unbiased estimating function, since &(9) = 
0. If [g= 0) > [8 =y], Oy is a ybased estimate of the SPM-parameter 0. 


Further 0 y = > ;/ D z;) is itself a Survey Population parameter. Godambe 


and Thompson (1986a) theory provides optimal (sample based) estimation for 0, 
as follows: Let h(d, 0) be any function of the parameter @ and the data 
d = {(1, y;) : i€ s} in (1), with 
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Eh - 9) = 0, (5) 


“E” being the expectation under the sampling design, holding y and @ in (I) and 
(4) above fixed. The function h satisfying (5) is called a (design) unbiased 
estimating function; a solution of the equation h(d, 0) = 0, provides an (data d 
based) estimate of both the parameters 0y and 0. Now the function h*(d, 0) 
satisfying (5) is said to be UM&V-f Optimum (f for estimating function), if for 
any h satisfying (5) 


SE(h* - 5)? < SE(h- 3)? (6) 


where & as before is the expectation w.r.t. the SPM-I&II above. 
Theorem. For SPM-I&II, and any sampling design with 7; > 0, i= 
1,...,V, UMEV-f h* is given by 


h* = D (y;- 0z;)/T; (7) 
2E 8 
Solving the equation h* = 0, we get for @ and 0 y the optimum estimate 


ie (4/7) 
ie (2/7) (8) 


ez 


As a special case for all z; = 1, in (8), 


ie (¥;/7;) 
eal DCE (9) 


The relationship between the estimates e in (9) and the HT-estimate epr in (3) is 
given by the fact that for any sampling design 


AD ie Al/n)} = N. 


Note, now, for the random sample size design, considered before in previous 
section, m; = (N+1)/2N, i = 1,...,N and when the whole population is sampled e 
in (9) unlike epp in (3) equals Y! 

A generalization of the theorem just stated is obtained by replacing in 
(II) y; - 8z; by any function 


T; Yi 0), 


covering many practical situations including (optimal) estimation of quantiles. 
The appeal, to the practitioners, of this approach is evident from the fact that 
special cases of the function ¢; above were already in common use (Binder, 1983) 
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before the present theory (Godambe & Thompson, 1986a) was developed. For 
further applications we refer to a later paper of Godambe and Thompson 


(1986b). 
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TWO BASIC PARTIAL ORDERINGS FOR DISTRIBUTIONS DERIVED 
FROM SCHUR FUNCTIONS AND MAJORIZATION 


Kumar Joag-Dev, University of Illinois and 
Florida State University 


and 


Jayaram Sethuraman, Florida State University 


Abstract 


Researchers in applied fields have long recognized the usefulness of 
inequalities when exact results are not available. The use of inequalities allows 
us to say that one estimate is better than another, that one maintenance policy is 
better than another or that a certain selection procedure is better than another 
etc., even though, we may not know the best estimator, the best maintenance 
policy or the best selection procedure. Such results are generally obtained from in- 
equalities between two probability measures or random variables. Inequalities 
between random variables are in turn obtained from deterministic inequalities or 
deterministic partial orderings. 

Hardy, Littlewood and Polya (1952) in their classical book entitled 
Inequalities have discussed various partial orderings in R”, one of which is known 
as majorization. Majorization is intimately related to Schur functions. This 
partial ordering was used to derive the partial orderings of stochastic 
majorization and DT ordering among distributions in a series of papers by 
Proschan and Sethuraman (1977); Nevius, Proschan and Sethuraman (1977); 
Hollander, Proschan and Sethuraman (1977); and Hollander, Proschan and 
Sethuraman (1981). Even though many more partial orderings of this type have 
been studied in recent papers and books by Marshall and Olkin (1979), Tong 
(1980), Boland, Tong and Proschan (1987, 1988), Abouammoh, El-Neweihi and 
Proschan (1989), the above two partial orderings remain the centerpiece in this 
type of research endeavor. In this expository paper, we describe the essentials of 
stochastic majorization and DT ordering and demonstrate some applications. A 
new proof of a slight generalization of earlier result on DT functions in Hollander 
et al., 1981 is given. 


Introduction 


Researchers in applied fields have long recognized the usefulness of 
inequalities when exact results are not available. The use of inequalities allows 


Research supported by the United States Army Research Office, Durham, under Grant No. 
DAAGLO3 86-K-0094. The United States Government is authorized to reproduce and 
distribute reprints for governmental purposes. FSU Technical Report Number M-814; USARO 
Technical Report Number D-109, September 1989. 
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us to say that one estimate is better than another, that one maintenance policy is 
better than another or that a certain selection procedure is better than another 
etc., even though, we may not know the best estimator, the best maintenance 
policy or the best selection procedure. Such results are generally obtained from 
inequalities between two probability measures or random variables. Inequalities 
between random variables are in turn obtained from deterministic inequalities or 
deterministic partial orderings. 

Hardy, Littlewood and Polya (1952) in their classical book entitled 
Inequalities have discussed various partial orderings in R”, one of which is known 
as majorization. Majorization is intimately related to Schur functions. This 
partial ordering was used to derive the partial orderings of stochastic 
majorization and DT ordering among distributions in a series of papers by 
Proschan and Sethuraman (1977) [PS 77]; Nevius, Proschan and Sethuraman 
(1977) [NPS 77]; Hollander, Proschan and Sethuraman (1977) [HPS 77]; and 
Hollander, Proschan and Sethuraman (1981) [HPS 81]. Even though many more 
partial orderings of this type have been studied in recent papers and books by 
Marshall and Olkin (1979), Tong (1980), Boland, Tong and Proschan (1987, 
1988), Abouammoh, El-Neweihi and Proschan (1989), the above two partial 
orderings remain the centerpiece in this type of research endeavor. In this 
expository paper, we describe the essentials of stochastic majorization and DT 
ordering and demonstrate some applications in the second and third sections. A 
new proof of a slight generalization of earlier result on DT functions is given in 
the third section. 


Schur Functions 


We begin by reviewing some basic concepts and results involving Schur 
functions. Given a vector z= (z4, Zo,...,Z,), let Tij Tiaj Tin] be a permutation 


of its co-ordinates satisfying Zr] > 29) an ee Tin] A vector x is said to 


. . m e e 
majorize a vector y, 1T > yin symbols, if 


J J 
ta > , fg 1, 2y..4n-1, 
2 [j] = PU f 


and 


B= h 


Majorization is not a true partial ordering on R” since z > yand y > T 
implies only that the co-ordinate sequence of z is a permutation of the co- 
ordinate sequence of y. However it is a partial ordering in the cone {z: z € R”, 


m 
Z, > t >... 2,}. In any case, z > y means that the co-ordinates of z are 
more spread out than those of y. 
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A measurable function f defined on R” will be called a Schur, function if 
it is either Schur-convez, that is, if f(z) > fly) whenever z > y, or is 


Schur-concave, that is, if f(z) < fy) whenever z > y. It is easy to construct 


Schur functions from the example below. 


Example 1 


Let f(z) = Do y9(z,). Then fz) is Schur convex if and only if g is Schur 
convex. 
A subset A of R” is called Schur increasing if it satisfies: 


m 
TE ÅA y> t>yeEA. 


Note that the indicator of a Schur increasing set is a Schur conver function and 
in fact such indicators are the building blocks of the class of Schur convex 
functions and act as their level sets. 

A partial ordering for random vectors can be defined as follows using 
Schur increasing sets. Let X and X’ be random n-vectors. Then X is said to 
stochastically majorize X' if for every Schur increasing set A in R”, P[X € A] > 
P[X’ € A], or equivalently, E[f{X)] > ERX’), for every bounded Schur convex 


function fon R”. This is stated, in symbols, as X X'. 

Stochastic majorization is a way of comparing distributions of random 
vectors in much the same way as the stochastic ordering is for comparing 
distributions functions of real random variables. In fact stochastic majorization 
can be equivalently defined as stochastic ordering between certain transformed 
random vectors. Recall that Z is said to be stochastically larger than Z if for 
every bounded nondecreasing function h, E[h(Z)] > Efk(Z)]. Consider the 


e d > . 
transformation y = (Y4, Y2- Yn) Z T(z), where y; = 2 j= Ti i = 1, 2,...,.n. It 
e d KJ 
is clear that C 2) TR" is a cone. Let X and X’ be two random vectors and let Y 


= TX and Y = TX’. Then it is easy to see that X = X’ if and only if E{g(Y)] 
> E{g(Y')] for all bounded measurable functions g such that g(y) > g(y) 


t 
whenever y; > y,;, i= 1, 2,...,n-1 and y, = y,’, that is, if and only if Y > Y’ 
and Y, st Y. 


Oftentimes one shows that families of random variables are stochastically 
ordered by showing that they satisfy a stronger condition called TP, defined 
below. A function ¢ defined on R, is said to be totally positive of order 2 (TP,) 
if it is nonnegative and satisfies 


lÀ» 21) $(Ad, Zo) > (Às Ly) $(Ag, T4), 


whenever A, < Ag, I} < To 
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Let p denote either the Lebesgue measure on [0, co] or the counting 
measure on the set of non-negative integers. A function defined on (0, oo) x 
[0, co) is said to possess a semigroup property in À if 


60) 4.) = J dna. 
(0) 


A class of theorems generally known as preservation theorems allows us 
to construct new Schur functions and understand their structure. The following 
is one of the first preservation theorems for Schur functions. We will see later 
that by using the TP, and Schur properties with a variety of preservation 
theorems, several commonly used parametric families of distributions possess 
interesting Schur properties. 


Theorem 1 


Let f(z) be a Schur convex (Schur concave) function and let (A, z) 
defined on (0, 00) x [0, co) possess the TP, property and the semigroup property 
in A. Let u be the Lebesgue measure or the counting measure. Let the integral 


Wa) = [TT 60s dAd 


be well defined. Then A(A) is Schur convex (Schur concave). 

This theorem appears as the main theorem in [PS 77]. In the principal 
application of this theorem, one takes ¢ to be a probability density function and 
shows that the operation of taking the expected value of a Schur convex function 
transfers the Schur convexity to the parameter vector. 


Theorem 2 
Let X and X’ be a pair of n-vectors and define S = 14; and S = 


7 4A, Then X pi X' if and only if (a) S = S and (b) for each bounded 
Schur convex function f, E[f(X)|S = s| > ELf.X’)|S' = s], for all s € Ap where 
the distribution of S assigns probability one to A 

This theorem is one of the important tools to be found in [NPS 77]. The 
notion of a Schur family extends the concept of stochastic majorization to a 
family of random variables. Let X, be a family of random vectors with a 
distribution P} indexed by Ain R”. The family X, and the family P, are said to 


be Schur families if À > A implies that X, ei Xy 

The following theorem shows that in Schur families, stochastic 
majorization is preserved among the posterior distributions when there is 
stochastic majorization among the prior distributions. 
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Theorem 3 
Let {X,} be a Schur family in A». Let G, and G, be two prior 


tm. A 
distributions for À, such that G, > G,. Then the posterior of X, under G; 
stochastically majorizes the posterior of X, under Gy. 


Example 2. Shock Models. 


Consider a system subject to a series of shocks and assume that the 
different types of shocks arrive in a Poissonian fashion. For example, suppose 
that X,(¢) denote the number of shocks of the i type arriving in the interval 
[0, #]. Let P(k), where k = (k,, ky,...,k,), be the survival probability of the 
system surviving k; shocks of the type i, i = 1, 2,...,n. Suppose that for each 1, 
the random variable X,(t) has a Poisson distribution with parameter 4,t. Then it 
follows that the survival function of the system is given by 


AG BAX (9), X(t nXq(0))} 


Assume further that P is Schur concave in k. This assumption holds, for 
example, if the effects of shocks are independent and the P is the product of n 
survival functions, each of which is logconcave. The TP, property of Poisson den- 
sity functions and Theorem 1 show that the survival function H(t; A) is Schur 
concave in À. For details see [PS 77]. 


Example 3. Schur Function of Partial Sums. 


Let Xp i = 1, 2,...0; j = 1, 2,...,4; be independent identically 
distributed random variables with common logconcave density function g. Let f 
be a Schur concave function and consider 


ko ka 
h(k) = el (3 2 X1, J? Pp» aye Shalt 


oles to a result of Karlin and Proschan (1960), the k-fold convolution 
g” (z) is TP, in k and z. Using this and Theorem 1, it follows that h(k) is Schur 
concave in k. 


Example 4. Schur Concavity of Moments. 


Let g be a Schur concave density with the support [0, co]”. Let a, i = 
1, 2,...,n, be positive numbers and let 


u(a) = |... f Mani 5 iz 
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be a multivariate normalized moment. One can rewrite the integrand as 


he i ni e “ij r(a,) ba(2) exp{ `. s)| Note that g(z)ezp{ >> z;} is Schur con- 


cave and that {1% te T/T (a)} is TP, in (a, z) and is a semigroup on (0, 00). 
From Theorem 1 it follows that M(a) is Schur concave. Note that there are 
examples where M(a) is Schur convex if the normalizing constant (a) is omitted 
in the integrand. 


Example 5. Schur Families. 


A number of parametric families found in standard textbooks can be 
shown to be Schur families. To name a few: multinomial, multivariate negative 
binomial, multivariate hypergeometric, Dirichlet. Furthermore, families of 
independent random variables such as Poisson, Gamma etc. also form Schur 
families. A host of such examples are listed and demonstrated in [NPS 77]. 


Functions Decreasing in ‘Transposition 


The partial ordering of majorization can sometimes be better understood 
by a standard partial ordering on the space of permutation on the set of n 
integers (1, 2,...,n). This leads to the concept of functions which are decreasing 
in transposition (DT) which extends the concept of Schur functions. 

Let E = (7, Ta, ...,%,,) denote a permutation of (1, 2,...,.n). Let S$ 
denote the group of such permutations m. Suppose that m and 2’ differ only in 
two of their components, say the t and ae where 2 < Jj, 7; < 7, and that Ti = 
Tj, T; = m; We say that x’ is a simple transposition of x. If a member of S, say 
x’ is obtained from ~ by successive simple transpositions, we say that ~ 
dominates x” in transposition and write m > 2”. Clearly this relation 
establishes a partial ordering in S. 

Suppose that the components of z are such that z4} < 2 <...< a, A 
permutation obtained by composing it with x is denoted by ~ o z and defined by 


TOT= (Er Tror nq )- 


The partial ordering defined above can be extended in an obvious way to the 
vectors obtained by permuting components of z. 

In many applications one considers two vectors, the first vector 
corresponding to a parameter and the second vector to an observed random 
variable. It is useful to describe in a mathematical fashion the fact that a 
random vector and its parameter vector increase and decrease together. 
Oftentimes one needs to study and compare the way in which two random 
vectors vary together. For instance, one use of rank correlation is to measure 
how similarly two random vectors vary together. We will see below that the 
partial ordering on permutations, defined above, provides a satisfactory way to 


compare how similarly two vectors, which may be random or deterministic, vary 
together. 
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Let A and = be subsets of R. A function g(A, z) is said to be decreasing 
in transposition (DT) on A” x =” if g(Aom, rom) = g(A, z), for every ~ (that is, 
g is invariant under the same permutation on the two vectors) and g(A, sox) > 


g(A, zon’), where A, < AZ <... SAY <<... St, anda > x. 
When g(x, y) is DT the function g(z, y) gives larger values when the 
ranking in the pair (z, y) is more similarly ordered than when the ranking is less 
similarly ordered. 
In certain applications there is only one vector and it is desirable to 
define functions of a single vector which exhibit a monotonicity under this partial 
ordering. Let h be defined on =” and suppose that the components of z are in 


increasing order. Then h is said to be DT if h(zor) > h(zox’) whenever 


be of 
T >T. 


DT functions occur quite frequently in statistics. The book of Marshall 
and Olkin (1979) has popularized the notion of DT functions under the more 
positive sounding name of Arrangement Increasing (AI) functions. The following 
theorem shows the relation between DT, Schur and TP, functions. 


Theorem 4 


(a) Suppose g(A, z) = A(A— z). Then g is DT on R?” if and only if h is 
Schur concave. 


(b) Suppose g(A, z) = h(A + z). Then g is DT on R?” if and only if h is 
Schur convex. 


(c) Suppose g(A, z) = []A(A,, z;), then g is DT on R?” if and only if h is 
TP}. 
The main result on DT functions is the following preservation theorem 


which states that the DT property is preserved under the operation of 
composition. 


Theorem 5 


Let g; i = 1, 2 be DT on R?” and o be a measure on R” such that for 
every Borel set A in R”, o(A) = a(r o A) for every x. Suppose that 


a, d= | ola Daly zdola), 


A 


is well defined. Then g is DT on R°”. 
The proofs of the above two theorems can be found in [HPS 77]. 
Theorem 1 can be derived as a consequence of Theorem 5 and Theorem 
4(b). Furthermore, the following result of Marshall and Olkin (1974) can also be 
obtained from Theorem 5 and Theorem 4(a). 
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Theorem 6 


The convolution of two Schur concave functions is Schur concave. 

Most of the families considered in the second section can also be shown 
to have DT property. In some sense this provides a better tool than Schur 
concavity because of the connections seen earlier. One of the interesting 
applications is the problem in ranking. Suppose the vector X has density ¢(A, z) 
which is a DT function. Let g(A, r) be the probability that the rank vector of X- 
observations is r. By using Theorem 5 above, it can be shown that g is DT. 
This has important consequences in nonparametric statistics. For details of this 
please see [HPS 77]. 

It should be noted that the concept of Schur concavity is closely related 
to that of unimodality. From the above discussion it can be seen that a function 
defined on R? is Schur concave if and only if it is permutation invariant and its 
graph is such that it is unimodal on every section perpendicular to the line of 
equality. This definition can be extended to R„ by considering all bivariate 
sections obtained by fixing (n-2) arguments and requiring Schur concavity for 
each section, in the sense just described. 

The convolution of two symmetric univariate unimodal densities can be 
shown to be a symmetric unimodal density. This is known as Wintner’s 
theorem. Using this result it follows that the convolution of two bivariate Schur 
concave densities is Schur concave. Again by considering sections, an alternative 
proof for Theorem 6 can be provided. 

The condition that the set {x f(z) > c} be convex and permutation 
invariant, for every c > 0, is sufficient for all the required sections of an n- 
variate density f(z) to be symmetric unimodal. Many results that follow from 
such basic unimodality have been explored in a book by Joag-Dev and 
Dharmadhikari (1988) which are useful in deriving various properties of Schur 
concave functions. For instance, consider a random vector whose density 
function is logconcave. The logconcavity implies that the set where the density 
exceeds a given constant is a convex set and hence it satisfies the condition state 
above. If the components of this random vector are also exchangeable, then the 
density function is Schur concave. 

An important theorem for multivariate logconcave densities is due to 
Prekopa (1973) which is stated below. 


Theorem 7 


Let Y = (Yj, Y,...,¥,,) have logconcave density. Then Z = (Zis 
Zase- Z4) z (2 aY D üz iYi. 20 a} i Y;) also has a logconcave density. In 
particular all marginals have logconcave densities. 

We will use this theorem to derive Schur concavity and DT properties of 
densities of some random vectors obtained as overlapping sums of random 
variables. We begin with a simple case before going to the general case because 
the notation can get quite complicated. 
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Theorem 8 
Let X15, X93, X13 and X53 be random variables such that X19, X93) X13 
are exchangeable. Define 
xi”) = Xj2 + X43; 
My = Xir F Xo 
a = X43 + X93; 
TaN) Ei 
Ty = XP? + Xin 
Ts = XY) + Xiz 


; d . : 
Then the density of T 2 (T,, T,, T3) is Schur concave under either one of the 
following conditions: 


(A) the joint density of X,5, X13, X23; X423 is log concave 


(B) the random vector (X,5, X13» X23) has a logconcave density and is 
independent of the random variable X423- 


Proof. Note that T consists of overlapping sums of random variables. A more 
general case of overlapping sums will be considered later. 

From the definition of T it is easy to see that it is exchangeable. The 
logconcavity of the density of T follows readily from Prekopa’s theorem 
(Theorem 7) under condition (A). This establishes the Schur concavity of the 
density of T under (A). When condition (B) holds, Préekopa’s theorem (‘Theorem 


7) once again shows that the density f(z,, 2, 23) of (xP, xe). x?) is Schur 
concave. The density function of T is given by 


[Xe -= Y, T3 — Y, 23 — y)g(y)dy 


where g(y) is the density function of X} 23. Since a positive mixture of Schur 
concave functions is Schur concave, it follows that the density of T is Schur 
concave. 0 

We now generalize the above to random vectors in R”. Let J = {1, 2, 
3,...n}. For k = 2,...,n, let 


I, = {I: Iis a subset of J with cardinality k}, 
and 


IŽ = U} {I € Ijandi;={1 € I:i € I. 
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Let {X; i= 1, 2,...,n} and Xp I E I* be a collection of random variables. Let 
Wk) = {X,;: I € L) x) = Virer xr and X‘4) = (xt), Os: Xt) where 


i = 1, 2,....n and k = 2, 3,...,.n. Thus Pl k) is the sum of random variables, each 
having k subscripts, one of ache is 2. 


Theorem 9 


Let XV) = (Xi Xo,...,X,,) be a random vector with probability wees 
function which is DT. Suppose that the set {Xn I € I*} is independent of x‘ 
and one of the following conditions holds. 


(A) The set of all variables {Xp I € I*} is exchangeable and has a 


logconcave joint density function. 


(B) The collection of random variables in W(k) has a logconcave density 
and is permutation invariant for k = 2, 3,...,n-1, and the collections 


W(2), W(3),..-, W(n) are independent. 
Then the joint distribution of Z = (Z4, Z),...,Z,) is DT, where 


Z,= X,+ 5 XP. 
k>2 


Proof. The argument is similar to the proof of Theorem 8. Let T; = }, > X t) 
and T= {T,, T,,...,T,}- 

ae density function of Z is the convolution of the density functions of T 
and x). the second of which is DT by assumption. If we can show that the 
density finetion of T is Schur concave, then it will follow that the density 
function of Zis DT from Theorems 4 and 5(a). 

We will now show that condition (A) or (B) implies the Schur concavity 
of the density of T. 

When condition (A) holds it is easy to see that Prékopa’s theorem 
implies that the joint density of T is logconcave. The permutation invariance of 
this joint density follows from the exchangeability of {Xp I € I*}. This 
establishes the joint density function satisfies the DT property. 

When condition (B) holds, Prekopa’s theorem once again shows that the 
density function of is log concave for k = 2, 3,...,n-1 and is permutation 
invariant. From the independence of W(k), k = 2, 3,...,n-1 it follows that the 
density function of T,—X Genap LA (in. an} is logconcave and permutation 


invariant and hence Schur concave. Notice that W(n) = X {1,...,n} consists of a 


single random variable. From the same argument given in case (B) of Theorem 
8, it follows that the density of T is Schur concave. 
This completes the proof of Theorem 9. D 
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Theorem 9 generalizes Theorem 2.1 of [HPS 81] and contains a new 
proof. As an application of this theorem it can be shown that the density 
function of a generalized compound multivariate Poisson is DT. See [HPS 81] for 
details. 
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OPTIMAL INTEGRATION OF SURVEYS 


P. K. Pathak, Department of Mathematics and 
Statistics, University of New Mexico, Albuquerque 


and 


M. Fahimi, Department of Mathematics and 
Statistics, University of New Mexico, Albuquerque 


The problem of integration of surveys is known to be of considerable 
practical as well as theoretical interest in the design of multi-purpose and 
continuing surveys. The object of this paper is to present a brief review of 
current developments in this area and to furnish a unified framework within 
which integration of surveys can be studied from various angles. 


Introduction 


The problem of integration of surveys, i.e., the problem of designing a 
sampling program for two or more surveys which maximizes the overlap between 
observed samples is known to be of considerable practical as well as theoretical 
interest in the design of multi-purpose and continuing surveys (Keyfitz, 1951). 
Development of cost-efficient sampling programs of this kind is a problem which 
agencies such as the National Sample Survey of India, Statistics Canada, the U.S. 
Bureau of the Census, the U.S.D.A., and others worldwide, have been continuing 
to tackle on an ad hoc basis. And although the literature on it is now over 40 
years old, basic research in it has reached a modest level of maturity only recent 
(cf. Arthanari and Dodge, 1981; Causey et al. 1985; Krishnamoorthy and Mitra, 
1987; Maczynski and Pathak, 1980; and others). Nevertheless despite these 
recent gains, there remains a pressing need for a unified framework within which 
integration of surveys and other similar problems of this nature, such as 
controlled selection and controlled rounding (Goodman and Kish, 1950, and 
Causey et al., 1985), can be studied with all their ramifications. In due course, 
such an approach is bound to provide a powerful guide to the cost-efficient design 
of survey programs commonly encountered in practice. The primary object of 
this paper is to briefly review the contemporary work in this area from the 
theoretical as well as computational viewpoints. 

In broad terms, integration of surveys can be referred to as the sampling 
program for two or more surveys. It has its origin in multipurpose surveys and 
sampling over successive occasions. In multipurpose surveys, the population 
characteristics under study are often subdivided into two or more groups of 


This research has been supported by the National Science Foundation Grants DMS-8703798 
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positively correlated characteristics and different sampling schemes are employed 
for data collection for the different groups of characteristics. For example in 
multipurpose surveys, traditionally population size is made the basis for socio- 
economic surveys and geographical area for agricultural surveys. This gives rise 
to the multivariate problem of designing an overall sampling program which 
imbeds the different sampling schemes into a single multi-variate sampling 
scheme in a cost-effective manner. A problem of similar kind arises in sampling 
over successive occasions (Keyfitz, 1951) in which a given population is sampled 
by probabilities proportional to size over two or more successive time periods. 
Over time, sizes of population units change and this necessitates sampling the 
given population according to a new set of probabilities at each new time period 
in a cost-effective manner. In this case, it makes sense to design an overall joint 
sampling program which in some sense maximizes the overlap between samples 
over different occasions, or equivalently minimizes the number of distinct 
population units sampled over different occasions. In applications of this nature, 
population units to be sampled are typically primary sampling units (psu’s) and 
their selection represents a considerable financial investment. A new independent 
selection amounts to selecting an almost new set of psu’s each time and is not 
cost-effective. On the other hand, the use of the same psu’s on succeeding 
occasions, much like in the paired t-test, leads to significant reductions in errors 
of comparisons between periodic surveys (Kish and Scott, 1971). Thus in appli- 
cations of this nature, an integrated joint sampling program which minimizes the 
number of distinct psu’s selected over successive time periods is cost-effective and 
highly desirable. 

The problem of integration of surveys for surveys involving two with 
replacement sampling schemes was originally formulated and solved by Keyfitz 
(1951). Lahiri (1954) proposed a serpentine arrangement of geographically 
contiguous psu’s for optimal integration of two surveys. Raj (1957) studied the 
problem of integration of two surveys as a transportation problem and 
established the optimality of Lahiri’s algorithm under a one-dimensional metric. 
Felligi (1966) studied the problem of integration of two without replacement 
sampling schemes and noted that even the simplest case of the sample size n = 2 
causes added complications. In the context of k (>2) with replacement 
sampling schemes, Maczynski and Pathak (1980) presented a general solution in 
a closed form under certain assumptions. More recently Krishnamoorthy and 
Mitra (1986), and Mitra and Pathak (1984) have presented sequential algorithms 
for ‘optimal’ integration of two or three surveys in the context of with 
replacement sampling (cf. Arthanari and Dodge, 1981; Keyfitz, 1951; Lahiri, 
1954; Raj, 1957; Kish and Scott, 1971; and others). The recent upsurge of 
research in this area is both practically useful and theoretically interesting. 

Although the connection between integration of surveys and the 
transportation problem is well-known (Aragon and Pathak, 1990; Arthanari and 
Dodge, 1981; Causey et al., 1985; Maczynski and Pathak, 1980, p. 137; and Raj, 
1957), it is only in recent years that serious attempts have been made to solve 
the problem of optimal integration of surveys as a transportation problem. In 
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general, integration of surveys is a transportation problem with an exponentially 
large number of variables, e.g. a simple problem of integration of two samples of 
size n = 5 each from a population of size N = 50 is equivalent to a transportation 
problem with approximately 4.5 x 101? variables. At the present time solvers of 
the transportation problems of this size are unavailable in the public domain. An 
despite the unique sparse structure of the underlying tableaus of integration of 
surveys, there are very few results in the literature on size reduction techniques as 
an alternative to solving these very large transportation problems. Much remains 
to be done in the area of size reduction techniques for integration of surveys. 


Formulation of the Problem 


For clarity in the exposition, we adopt the following terminology: 


Z : the set of the first N natural numbers. We use the artifice 
of identifying the population under study by Z. 


k number of surveys to be carried out, k > 2. 
So i collection of all possible samples from which a sample is 


to be drawn for each of the k surveys; S being a subset of 
the power set of Z. 


n; : sample size for the i survey, 1 < i < k. 
tf: the outcome of the i* survey. 
A : the joint outcome of k surveys, i.e., X = (zj,...,2;)- 


the probability of selecting the je sample s; on the it 
survey, i.e., P;; = P(t; = s.), zEX,s ESL Sick. 


d ž : a cost function defined on S$", i.e., it is non-negative and 
sub-additive on S*. 


A survey or a sampling scheme on Z is a given, but otherwise quite 
arbitrary, collection S of samples from Z endowed with a given probability 
distribution P. Thus a survey is expressed by the pair (S, P) = {(s, P(s)): s€ S} 
in which P(s) denotes the probability of selection of the sample s. 

The problem of optimal integration of k surveys can now be stated as 
follows: 

Given k individual surveys, (S, P;), 1 < i < k, and a cost function d on 
S* find a joint probability distribution P for X on o which for each z; realizes 
the preassigned marginal probabilities P; determined by the jth survey, 1.€., 
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P(z; = s;) = P; and at the same time minimizes the expectation of the cost func- 
tion over the class of all surveys of this kind. 

Raj (1957) was perhaps the first to paraphrase the problem of integration 
of surveys as a transportation problem. In terms of our terminology, it is as 
follows: 


Problem 1 
Given k surveys {(S, P,): 1 < i < k} and a cost function d on S$, 


minimize ¢(X) = È yP(X) - d(X) 


subject to dx, A(X) = P; Xj = {x € oh. z; = s;}, 


P(X > 0,yY Xe &. 


Example 1 


Consider a population of four psu’s and suppose that on two occasions a 
sample of size two (wor) is to be drawn from the population according to the 
sampling schemes given in Table 1. 


Table 1: Sampling Schemes for Example 1 


To maximize the expectation of the overlap between the two samples selected on 
the two occasions, the cost function d is taken to be the number of distinct psu’s 
in the two samples, i.e., d(s,, U s,) = #(s,, U s,). The transportation problem 
representation of this problem is as follows: 


minimize > 2 P(21 = Sm 22 = Sp) dS ms Sp) 


m=1 n=1 


6 
subject to D P(z, = Sm 2g = Sn) = Pim 
n=i 
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6 
2, P(t, = Sm Ta = Sp) = Pony 


m=1 
P(t, = 5,,, 29 = §,) > 
Vm, n= 1,...,6. 


To solve this problem, one can use a standard linear programming package, e.g., 
the use of the SPLO-program (1981) yields the results summarized in Table 2. 


Table 2: A Solution to Example 1 
P(2, = 5,,, ty = S,), m, n= 1,...,6 


Survey II/I 


epp 
CEAC E E E D T 
1,3 | 0.01 a 


Based on this solution, we find that the minimum value of the objective function 
as defined in Problem 1 is 2.29. It is worth noting that this represents the 
maximum expected overlap between the two sampling schemes. 


Integration of Surveys with Replacement Sampling Schemes 


Unfortunately, most realistic problems of integration of surveys are not 
as tractable as the example in the preceding section seems to indicate. For 
example, it is easily seen that a straight forward 3-dimensional integration of 
surveys problem for a population of 20 psu’s for without replacement samples of 
size n = 3 amounts to a transportation problem with over a billion variables. At 
the present time, hardware or software which can handle a problem of this 
magnitude is unavailable in the public domain. So it is not surprising at all that 
earlier attempts at solutions of these problems were largely directed towards 
finding closed form solutions under special circumstances. The first general result 
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of this nature was obtained by Maczynski and Pathak (1980). Based on the 
following lemma, they provided closed form solutions for the special case of the 
sample size n; = 1 and k surveys, k > 2. 


Lemma 1 


Consider the general problem of integration of k surveys with n; = 1 and 
suppose that there exists a joint probability distribution P on S* such that for 
each h,l < h < kandli <} <... <i Lk, 


P(r; =e j= min( P; i +4+P; j) 


Then 


Moreover such a P minimizes the expected number of distinct psu’s selected in all 
the k samples. 


The case k = 2. An immediate corollary of Lemma 1 is that for k = 2 surveys 
with n; = 1, the following closed-form solution is optimal: 


P(z, = h, z = 7) = Kh), h=j 
P(x, = h, t3 = 3) = fih, 3), h< >j 


where 


Ah) = min( Pin Par), 
falh 3) = Pin- RAPo; — Kh))\(1 - ERAY. 


Algorithmic representations of the above solution have been provided by 
Keyfitz (1951) and Mitra and Pathak (1984). 


The case k = 3. The problem of optimal integration of three surveys with n; = 1 
is essentially solved now. In this case Lemma 1 forms the basis of all closed-form 
solutions which minimize the expected number of distinct sample units over the 
three occasions. In a series of fundamental papers Krishnamoorthy and Mitra 
(1986, 1987) have established the optimality of the Mitra-Pathak type 
algorithmic approach (1984) for the integration of three with replacement 
sampling schemes. For further details in this connection we refer the reader to 
the elegant work of Krishnamoorthy and Mitra (1987). For completeness, it 
would be worthwhile indeed to investigate extensions of these results to the 
general case of k surveys with k > 3. 
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Size Reduction Techniques 


In this section we present a brief review of a technique which has the 
potential of significantly reducing the size of the induced transportation problem 
in the context of integration of two surveys. To illustrate this technique, consider 
the induced transportation problem of Example 1 and observe that the cost 
function d of this example is in fact a metric. This simple observation allows one 
to establish the following interesting result (Aragon and Pathak, 1990): 


Theorem 1 


Consider the problem of integration of two surveys of equal size in which 
the cost function d is induced by a metric. Then there is an optimal feasible 
solution P such that for each sample s, 


P(z, = Sm T2 = Sm) = min( Pim Pom) 


The theorem implies that by setting these diagonal probabilities equal to 
their largest admissible values, namely the minimum of the corresponding row 
and column marginal probabilities, at least half of the restrictions of the problem 
are satisfied. What is left then is the optimal determination of the remaining 
unknown nondiagonal probabilities, i.e. P(z, = s,,, 2, = S$), for only some of 
m < > n. This reduced problem is at most one-fourth the size of the original 
problem. For example, application of this technique to Example 1 reduces the 
original problem with 36 variables to a smaller problem with only 9 variables. 
The reduced problem is stated below and an optimal solution summarized in 


Table 3. 


3 3 
minimize D D P(t, = Sms 22 = Sp) © US Sa) 
m=1 n=1 
, 3 
subject to a E ee ee ee ee 
n=1 


3 
2 P(t, = Sm 22 = Sp) = Pons 


m=1 
P(2, = Sm! Ta = S) > 0, 


V m, n= 1,...,3. 
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Table 3: A Solution to the Reduced Version of Example 1 
P(t, = 5,5 to = $,), m, n= 1,...,3 


Survey I 


Note that the marginal probabilities P,,, and P,, are given by the marginal 
entries in Table 3. And that Table 3 was obtained from Table 1 after the 
assignment of the main diagonal probabilities P(s, s). Thus at the first stage of 
this size reduction technique, we set P(s,, s1) = .04, P(s,, $3) =... = P(s,, sg) = 
0, P(sq, $1) = 0, P(s3, 5.) = .15,...,P(sg, S) = .14. Then at the second stage, the 
reduced problem is solved by using a standard transportation problem solver. 
The two solutions when combined together furnish a complete solution to the 
original problem of Example 1. 

In order to present this size reduction technique in greater generality and 
scope, a slight digression from the main theme of the paper is necessary. We 
turn now to the following so-called Hitchock transportation problem (Chvatal, 
1983, p. 345): 


Problem 2 
minimize eS Ds CB. 
i j 
subject to x Ti; = Pi (2 = 1,...,m), 
j 
2 Ti; = 4; OS laan) 
T; > 0, V 4, 7. 


Moreover, suppose that in the preceding transportation problem, there are cells in 
the cost matrix C = {c;} with the following property of negative variation: 


Definition 1 


A cell (3, j) of the cost matrix C = {c;;} is said to have negative variation 
if for all k < > iandl < > j, the following inequality holds: 
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(cy + cy) — (cat cu) < 0. 


Similarly, the cell (2, 7) is said to have positive variation if the above difference is 
always non-negative. 

If the cost matrix of a given transportation problem has cells with 
negative variation, then the given problem can be reduced to a new problem of a 
smaller size. Specifically, if the original problem is of size mx n and has c cells 
with negative variation, then the reduced problem is of size (m-a) x (n—-b) with 
a+b > c. This size reduction is a consequence of the following theorem: 


Theorem 2 


Suppose that a given cell, say (1, 1), of the cost matrix of Problem 2 has 
negative variation. Then there exists an optimal feasible solution X = {z,;} with 
ty, = min(py, %)- 

In a different guise, this theorem can be found hidden in the seminal 
work of the noted French mathematician Monsieur Monge (1781). In a totally 
different context in operations research, it has been used by A.J. Hoffman (1963). 
We independently discovered it in the context of integration of surveys and 
controlled selection. We take the liberty of referring to this theorem as the 
Monge-Hoffman size reduction theorem. 


Corollary 1 


If the objective of Problem 2 is maximization instead of minimization, 
then the above theorem goes through provided the cell (1, 1) has positive 
variation. 

Now consider the Transportation Problem 2 and assume that the cell 
(1,1) of its cost matrix has negative variation. Also, without loss of generality 
assume that p} < q,- Then the preceding theorem implies that the original 
problem can be replaced by the following smaller problem: 


minimize ot, 2 Sigmi<j<cn 
i j 


subject to 2 Tij = Pi (i = 2,...,m), 
J 


» Tij = qj (j = lesah) 


? 


Tij > 0, V 1, J. 


Clearly any optimal solution of this problem, along with 41 = Pi M2 = 
. = 2,, = 0, will provide an optimal solution to the original problem. This 
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effectively reduces the number of variables from mx n to (m-1) x n. If instead 
of pp < q, we have q) < pı, a similar consideration will show that an optimal 
solution now is given by 24, = 41) 291 = ++» = Zm , = 0 and the values of the 
remaining variables are obtained by solving an analogous reduced problem 
involving m x (n-1) variables. Finally if p} = q}, then an optimal solution is 
given by 21, = P4 = fp 29 = +++ = Tip = FQ] = ++ = T1 = 0, and the values of 
the remaining variables are obtained by solving another analogous reduced 
problem of (m-1) x (n-1) variables. 

Note that if the cost matrix in the transportation problem has multiple 
cells with negative variation then the preceding size reduction algorithm can be 
carried out sequentially until all the cells with negative variation have been 
removed. In the special case of c;; = |j — il, this size reduction procedure can be 
carried out to the very end and an optimal solution can be obtained without ever 
having to invoke any solvers of the transportation problem. This last result is a 
consequence of the following theorem. 


Theorem 3 


Consider the mxn matrix D = {d;;} in which d;; = |j — |, and suppose 
that the first r rows and the first c columns of D have been removed. Let s = 
maz(r, c) and t = min(m, n). Then the following holds: 


a) All the cells on the shortest path joining the cells (1, 1), (s-r+1, 
s-c+1), (t-r, t-c) and (m-—r, n—-c) have negative variation. 


b) All the cells (1, 7) with i > t-1r,j < s- c+ 1, and all cells (k, J) 
with k < s—r+1,l > t- chave positive variation. 


Corollary 1 


The above theorem also holds if some of the very last rows and columns 
of the matrix are removed as well. (This should be self-evident since the removal 


of rows and columns from the end leaves original structure of the matrix D 
intact.) 


Corollary 2 


Suppose that d(z, y) has the following property of a distribution function 
in two dimensions: 


d(a+h, y+k) — d(x, y+k) - d(z+h, y) + d(x, y) > 0. 
for all h, k > 0. Then the northeast and the southwest cells of the matrix D 


have positive variation, while the northwest and the southeast cells have negative 
Variation. 
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An immediate consequence of the above corollary is that for distance 
functions such as d(z, y) = zy, the conventional northwest (greedy) algorithm 
provides an optimum solution for the Hitchock Transportation Problem 2. 


Example 2 


The purpose of this example is to graphically illustrate the statement of 
Theorem 3. Consider a 9 x 7 matrix D for Theorem 3. Tables 4 through 6 
summarize the variations of D when certain initial rows and columns are 
removed from it. 


Table 4: Variations of the Matrix D 
(0 rows and 0 columns are removed) 


Conon h WN me 


Table 5: Variations of the Matrix D 
(3 rows and 2 columns are removed) 
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Table 6: Variations of the Matrix D 
(2 rows and 3 columns are removed) 


When the underlying cost coefficients c;;s of Problem 2 are given by the 
matrix D, the northeast algorithm (Algorithm 1) provides a complete solution to 
the Problem 2. To determine the variations of the cells of an arbitrary matrix, 
the algorithms similar to the positive variation algorithm (Algorithm II) can be 
used. Both of these algorithms are given at the end of this paper. 


Example 3 


This example is taken from the paper by Causey, Cox, and Ernst (1985) 
on the problem of maximizing the overlap between two surveys. The sampling 
scheme is summarized in Table 7. The cost matrix C = { cj}, where c; = #(s; 
N s;), along with its variations are summarized in Table 8. The object is to: 


subject to Lo Pi = Py 1 < 2 12, 


lA 


Py = Gl <i < 5, 


Pj 2 0, V 4,3. 


lA 


It follows from Corollary 1 of Theorem 2 that there exists an optimal 
solution for this problem such that for 1 < i,j < 5, P(s;, s;) = min(p,;, Poj) for 
i = j and zero otherwise. This partial solution reduces the size of the original 
problem from 12 x 5 = 60 to 7 x 5 = 35 variables as summarized in Table 9. 
The reduced problem can now be solved using any standard solver of 
transportation problems. 
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Table 7: Sampling Scheme for Example 3 
Survey I Survey II 


+ 


+ 


oo orrocoooco oc KF & 

Oorrocjre coco or Oo o 

oorcrorcocoore Oo Oo © 
+ 

eta Neg a Seen aia eens re 


1 
0 
0 
0 
0 
1 
1 
0 
0 
0 
0 
0 
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Table 9: Sampling Scheme for the Reduced Problem 


Survey II 
~ [mf | a 


Example 3 


This example establishes the optimality of Lahiri’s selection scheme 
(1954). This scheme requires a serpentine ordering of the psu’s as illustrated in 
Figure 1 below so that geographically contiguous units occur next to each other 
in the sampling frame. 


Ley 


Fig. 1. Lahiri’s Serpentine Ordering of PSU’s 
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Algorithm I (Northeast) 


begin 
Pi; := 0; 1 SLi <m1<j< n 
i= 12S n 
while (j > 0) or (i < m)do 
if p[:] < qf] then 
begin 
Pi t= pla]; 


qi] == al] - pli); 


1:=i+1 
end 
else if p[i] > qļj] then 
begin 
Pi; = alj]; 
pl] := ple] - gl); 
j:=j-1 
end 
else 
begin 
Pij = gl]; 
1:= 1+ l; 
ji:=j-1 
end 


end. 


Algorithm II (Positive Variation) 


begin 
for 2:= 1 to m do; 
for 7 := 1 to n do; 


begin 
positive := true; 
ae 
repeat 
enk, 
repeat 


if variation(2,j,k,l) < 0 
then positive := 
false 

f= 1+ 1 
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until (not positive) or 


(l > n); 
k:=k+1 
until (not positive) or (k > m) 
end. 
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THE MODEL BASED (PREDICTION) APPROACH TO FINITE 
POPULATION SAMPLING THEORY 


Richard M. Royall, Department of Biostatistics, 
The Johns Hopkins University 


Introduction 


Estimating a finite population mean from a sample is equivalent to 
predicting the mean of the non-sample values. This view, that finite population 
inference problems are actually prediction problems, leads naturally to a theory 
in which prediction models, not sample selection probabilities, are central. This 
paper is an informal survey of that theory. 

The first section describes the model-based approach and attempts to 
make clear how and why it differs from the prevailing (randomization-based) 
theory. This section is built around a simple example, which is used to illustrate 
various facets of the approach. The second section addresses the question “What 
has the model-based approach accomplished?” This is not an attempt to catalog 
significant contributions to model-based sampling theory, but to describe and 
interpret the general kinds of developments that have occurred. Finally, the 
third section consists of some brief observations on current research. 


What Is Model-Based Sampling Theory? 


Model-based sampling theory begins by recognizing that problems of 
estimating finite population characteristics are naturally expressed as prediction 
problems (Kalbfleisch and Sprott, 1969; Geisser, 1986, p. 163). For example, 
Figure 1 shows the data for a sample of n = 32 hospitals. For each sample 
hospital we know the number of beds (z) and we have observed the number of 
patients discharged (y) during a given month. If we must estimate how many 
patients were discharged from another hospital, say one with x = 400 beds, we 
might fit the dotted line in Figure 1. The slope of that line, the ratio of total 
sample discharges to total sample beds, shows that in sample hospitals there were 
3.1 patients discharged per bed. Thus we might estimate that there were about 
3.1 x 400 = 1240 patients discharged from the other hospital. More generally, 
to estimate how many patients were discharged from a set r of non-sample hospi- 
tals having a total of £x; beds, we might use 3.1 Ex.. Then to estimate the 
patient total for the entire population composed of the thirty-two hospitals in the 
sample s as well as those in r, we would simply add the observed total for the 
thirty-two sample hospitals, Xy; to our estimate for those not observed, 3.1 pk. 

Clearly this estimate of the population total is reasonable only if it is 
reasonable to assume that the hospitals in r are “like” the ones in s: if the sample 
hospitals are in the eastern United States while the r-hospitals are in France, then 
this estimate is certainly questionable. How can we formalize this reasoning, 
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PATIENTS 
3000 


2000 


1000 


BEDS 


Figure 1. Number of patients discharged and number of beds in 32 short-stay 
U.S. Hospitals, June 1968. 
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exposing and clarifying the underlying assumptions and explaining when the 
estimate is a good one, and when it is not? 

A natural way to express the assumptions is through a probability model 
for the numbers of patients discharged from each of the hospitals, both those in 
the sample s and those in r. The model represents these numbers, 9), 9)--+.¥N 
as realized values of independent random variables Y}, Y3,...,Y y, where N is the 
total number of hospitals. 


Model M. E( Y;) = pr, var( Y;) = oa; ’ 
cov Y; Y)=0,i #3 


Under model M the y’s will tend to be roughly proportional to the 2’s, 
with more variability about the expected value, Øz, in large hospitals than in 
small ones. This model is consistent with the thirty-two observations shown in 
Figure 1, but is not unique in this respect. We must be alert to the possibility 
that other models might be more appropriate. Nevertheless, analysis under 
model M can explain much of what our informal look at the problem has already 
suggested. 

First we note that the model represents a link between the two sets of 
numbers, {y; i€ s} and {y; i€ r}, that enables us to learn about the second set 
by studying the first. Now the problem of estimating T = L,y; + %,y; is 
evidently equivalent to the problem of predicting the value, L,y,, of the random 
variable, Ł„Y; The estimate that we derived intuitively, Tp = Ly + 6u,2, 
where b = X,y/X,z = 3.1, is the best linear unbiased (BLU) estimator of T under 
model M, becas. bu« is the BLU precio of uy. Note that this is actually 
the populat ratio esl analon (u,9/d, 2) D4. z. The reason that we would not use 
this estimate if the non-sample hospitals were in France is that we would be 
unwilling to apply the same model (with the same value of the expected number 
of patients per bed, #) to both the sample and non-sample facilities. Note that 
this conclusion would apply even if we had decided at random which ones to 
exclude from the sample and had chosen the overseas hospitals by bad luck. Our 
reluctance to use the sample ratio, 3.1 discharges per bed, to estimate for those 
not in the sample arises from unwillingness to make the assumptions expressed in 
the model, not from the process used to choose which hospitals to put in the 
sample and which ones to leave out. 

The model also provides guidance in sampling. For a given split of the 
population into sample s and non-sample r hospitals, the estimation (prediction) 
error in the ratio estimate of T is Tp — T= bur — Ey. Its expected value 
under M is zero and its variance is var(Tp — T) = (N/ A — f\(z2,/2,)o7, where 
z is the population mean, z, and Z, are sample and non-sample means, n is the 
sample size, and f is the sampling fraction, n/N. This variance decreases as T, 
increases, so it is minimized when s consists of the n largest hospitals. Equally 
important, it is maztmized when s consists of the n smallest. Although we will 
find that robustness considerations imply that it is often unwise to choose the 
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largest units for s, the smallest units represent the worst possible sample under a 
wide variety of conditions. 

Another role of the model is to validate large sample confidence intervals: 
if the population is enlarged, so that both sets of hospitals, s and r, grow in a 
stable way, then (Tp — T)/[var(Tp — T))'/ 2 converges in distribution to the 
standard normal. Because v = %,(y; — bz;)"/nz; is a consistent estimator of o’, 
an approximate confidence interval for T is given by Tp + 
(N/A — CACACE 2 when n and N—n are both large (Royall and 
Cumberland, 1978). 

Although the ratio estimator is BLU under model M, other estimators 
might also be considered, because of robustness, simplicity, or other criteria. 
Analysis under M remains critical: the estimator we choose must at least have 
reasonable properties under this model if it is to be appropriate for estimating the 
total number of patients discharged. For example, the simple expansion 
estimator Tp = Ly + (N — n)y, = Ny,, which estimates the non-sample mean 9, 
by the sample mean J, would be inappropriate here in any sample s of hospitals 
whose mean size Z, is not very close to the population mean z. This is because 
the estimator is biased under M: 


E(T, — T) = NB, — 3). 


This expression shows that the expansion estimator will tend to underestimate T 
if the average size of sample hospitals, Z, is smaller than the population average, 
Z, and to overestimate when 7, is larger. By contrast, the linear regression 
estimator Trg = Ny, + 5,(@ —%,)] where b = U,(z; — 7,)y;/Z,(2; — z), is, 
like the ratio estimator, unbiased under M in any sample s: E(Tpg — T) = 0. 

Thus we can evaluate estimators in terms of bias and variance under M, 
study how these properties are affected by characteristics of the sample, like 7,, 
and find approximating distributions for setting confidence intervals. If M were 
known to be true, then this body of theoretical results might be satisfactory for 
guiding us in selecting a sample and making inferences from observations. 

But M is not true. A sufficiently large sample of hospitals would surely 
reveal that M, like any mathematical model, is at best an approximation. 
Although we have adopted M as a working model for this population, we remain 
skeptical, aware that theoretical results derived under M have practical value 
only if they are robust in the face of plausible departures from this model. 

Robustness to departures from M can be studied by changing the model. 
For example if we generalize by relaxing the restriction that var(Y;) be 
proportional to z,, we see that the ratio estimator remains unbiased and 
consistent. But the large-sample confidence interval is no longer valid, because 
the estimator of var( T — T) is no longer consistent. Fortunately there are 
variance estimators that are consistent under the generalized model, providing 
robust large sample confidence intervals (Royall and Cumberland, 1981a). 

To study the effects of errors in the working model’s regression function, 
E(Y,;) = z; we might consider a sequence of generalizations, first adding a 
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constant term, then a quadratic, etc. Each term added to the regression model in- 
troduces bias in the ratio estimator. For example, if E( Y;) = a + z; then the 
bias is E(Tp — T) = No(z — 2,)/2,. Protection against this bias can be achieved 
by choosing a sample that is balanced on z: z, = 7. Protection against a 
quadratic term’s bias can be achieved by balancing on z as well: E in = n= 
ey z*/N. And balancing on other powers of z protects against bias extised by the 
presence of corresponding terms in the true regression function (Royall and 
Herson, 1973). 

Thus in order to protect against the bias that can be caused by departure 
from the working model’s regression function, we might choose a balanced sample 
in preference to the optimal (minimum variance) sample composed of the n 
largest hospitals. The same type of trade-off, efficiency for robustness, might 
apply to other aspects of the problem as well, such as the choice of an estimator 
for T and an estimator of the error variance, var( T — T). The model-based 
theory does not assume that a particular model is correct and proceed blindly 
under that assumption: alternative models are used to examine the key practical 
issue of robustness. 

The main features of model-based sampling theory have appeared in our 
look at the hospital discharge population: 


(i) representing the unknown numbers of interest as realized values of 
observable random variables, 


(ii) recognizing that estimating a population value from an observed 
sample is a prediction problem, and 


(iii) using probability models as the formal basis for prediction and for 
determining the primary statistical properties of samples and 
predictors. 


The use of probability models as the basis for inference from sample to 
population, (iii), is the critical feature distinguishing the model-based theory from 
the prevailing one. Although a random sampling plan may be used for choosing 
which hospitals will be observed (and for which hospitals the number of 
discharges must be estimated), the basic inference framework is the probability 
model, not the random sampling plan. By contrast, the prevailing theory of 
finite population sampling reverses the priority, avoiding probability models in 
favor of distributions created by random sampling plans as the formal basis for 
inference. 

Conventional theory defines bias, for example, with respect to the 
probability distribution generated by the random sampling plan. Thus the 
expansion estimator, Ny, is an unbiased estimator of T if every set of n hospitals 
is given the same probability of being selected as the sample. But the same 
estimator is biased if the sample is chosen by another selection scheme. The bias 
in Ny, is determined, not by relationships between the hospitals in s and those 
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not in s, but by the probabilities with which other samples might have been 
selected. Recall that the model-based theory under model M said that this 
estimator has a positive bias if the sample consists of hospitals that are larger, on 
average, than those not in the sample, a negative bias if the sample hospitals are 
smaller, and no bias only if the sample is balanced on size. Although both 
definitions of bias are mathematically valid, for the purpose of inference from a 
given sample of hospitals the model-based one is clearly relevant and informative 
while the conventional one is misleading. 

Conventional theory defines variance also as an average value over all 
possible samples. Again this is in contrast to model-based theory, which, because 
it defines the variance for a specific sample with respect to a prediction model for 
the unobserved variates, conditions on the characteristics of the sample actually 
observed as well as on those of the non-sample units whose values must be 
predicted. 

Model-based theory, by insisting that inferences should be based on 
prediction models, not on probability distributions created by randomly choosing 
which units to observe, does not preclude the use of random sampling plans. It is 
not the presence or absence, but the role, of random sampling that distinguishes 
model-based from conventional finite population sampling theory. The 
terminology invites misunderstanding on this point: because the word sampling 
in the name suggests only the design phase—choosing samples — model-based 
sampling theory is easily misinterpreted as signifying a theory for choosing 
samples using models, whereas the critical feature is the use of models in 
inference. 

There are other model-based approaches. The one sketched above is 
developed in terms of bias, variance, and approximate normality under linear 
models. Alternatives include approaches based on fiducial (Kalbfleisch and 
Sprott, 1969), likelihood (Royall, 1976b), and Bayesian prediction models. 
Ericson (1988) has recently surveyed the Bayesian theory. We will focus on the 
linear prediction approach, because it has seen the most vigorous development, 
empirical testing, and critical discussion. 


What Has the Model-Based Approach Accomplished? 


The model-based approach has bridged the gap between finite population 
problems and the rest of statistics. Before the model-based approach, finite 
population sampling was an eccentric realm where many of the basic concepts 
and tools of statistics were curiously inapplicable. Statisticians skilled in 
designing experiments and in applying linear models to make inferences from 
experimental and observational data found that finite population problems were 
apparently beyond the scope of their techniques. Although there were some 
familiar-looking formulas, such as the linear regression estimator shown in 
Section 1, these statistics lacked the familiar rationale and properties. Not only 
was the linear regression estimator biased (and therefore certainly not a BLU 
estimator) it was not even linear, because the random choice of observation 
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points turned the denominator of the estimated slope into a random variable. To 
make matters appear utterly hopeless to one interested in statistical theory, 
Godambe (1955) proved that the BLU estimator for a finite population average 
does not exist and furthermore (1966) showed that the likelihood function 
generated by a random sample from a finite population is, for all practical pur- 
poses, totally uninformative. Attempts to fill the theoretical vacuum were 
uniformly unsuccessful (e.g., Godambe, 1966; Hanurav, 1968; Hartley and Rao, 
1969; Royall, 1969). 

The prediction approach revealed that the problem was rooted, not in 
esoteric aspects of finite population problems that invalidated the methods 
applicable to the rest of statistics, but in the attachment of those who worked in 
finite population sampling theory to a restrictive statistical doctrine based on a 
dubious principle. This is the Randomization Principle, proclaimed and then 
renounced by Fisher (1935 §21, 1960 §21.1), which asserts that the only 
probability distributions appropriate for statistical inference are those created by 
deliberate randomization. 

A particularly clear statement of the Randomization Principle in the 
finite population setting was given by Stuart (1962): 


If you feel at times that the statistician, in his insistence on 
random sampling methods, is merely talking himself into a job, 
you should chasten yourself with the reflection that in the absence 
of random sampling, the whole apparatus of inference from sample 
to population falls to the ground, leaving the sampler without a 
scientific basis for the inferences which he wishes to make. 


This Principle has had its champions in experimental statistics 
(Kempthorne, 1955), where it underlies the curious claim that no valid statistical 
inferences are possible in observational studies. (This last point is discussed in 
Royall (1976a), with references.) But in that area the Principle faced strong 
opposition, from “Student” (1937) and Neyman and Pearson (1937, p. 384) for 
example, and it never held sway. The Principle’s unchallenged domination of 
finite population theory is thus curious; it is doubly curious because this 
domination is credited to Neyman (1934) (ref. Smith, 1976; O’Muircheartaigh 
and Wong, 1981). 

The theoretical vacuum in finite population sampling was an inevitable 
consequence of the Randomization Principle. If the Principle is applied in other 
areas of statistics, entirely analogous results follow: if all inferences must be 
based on the probability distribution created by artificial randomization, so that 
all variables that have not been made random by the experimenter’s actions must 
be treated as fixed (possibly unknown) constants, then the likelihood function for 
randomized comparative experiments is just like the finite population likelihood 
function—uninformative (Cornfield, 1966). Likewise, if inferences about 
regression coefficients must be based on the distribution created by using 
deliberate randomization to select material for observation, then the 
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Gauss-Markov theorem can justify least-squares estimators only in those cases 
where at each value of the regressor z the average response ¥ over all units 
actually available for observation falls precisely on the regression line: thus the 
Principle would imply the non-existence of BLU estimators in essentially all real- 
world applications, certainly including all problems where each potential sample 
unit is characterized by a unique vector of regressor values. 

Deliberate randomization is a valuable statistical tool (for protecting 
against unconscious bias, for example). Few statisticians would deny this. But 
the Randomization Principle claims much more: the only biases, standard errors, 
significance levels, and confidence coefficients acceptable for inference are those 
defined and justified in terms of deliberate randomization. The model-based 
prediction approach to finite population sampling consists of nothing more 
radical than taking the concepts, techniques, and tools that form the familiar 
core of applied statistics and using them where previously they had been 
precluded by acceptance of the Randomization Principle. This has had several 
important effects: 


(i) providing techniques for systematic study of some finite population 
sampling problems that the randomization approach is ill-equipped to 
address, 


(ii) bringing an alternative theoretical perspective to finite population 
methods that have been analyzed previously in terms of randomization 
theory, 


(iii) revitalizing conventional randomization-based finite population theory, 


(iv) providing a new context for studying the model-based methods that 
are standard outside of finite populations, and 


(v) testing general statistical concepts and principles in a new setting. 


Examples in the first category — problems that are difficult to address in 
terms of deliberate randomization alone — include non-response (Sarndal, 1981; 
Little, 1982; Chiu and Sedransk, 1986), small area estimation (Laake, 1979; Holt, 
Smith, and Tomberlin, 1979; Royall, 1979), and inference from non-random 
samples (Smith, 1983; Kott, 1984). This is not to say that there was no 
methodology for these problems before the model-based approach came along. 
There were various techniques that had been derived intuitively and developed by 
trial and error. What models did was to provide a theoretical framework for 
studying the methods (such as synthetic estimates for small area estimation) and 
for describing the implicit assumptions behind them, as well as for suggesting 
alternatives. 

Of greater theoretical interest are activities of the second type — 
applications of the model-based approach to problems where the old 
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randomization approach had already generated a body of results. In some cases 
the prediction approach simply provided a new explanation and interpretation of 
conclusions that had been reached by conventional sampling theory. An example 
is the finding that the Yates-Grundy estimator is better than the Horvitz- 
Thompson estimator for the variance of the mean-of-ratios statistics, 
NzX,(y,/2,;)/n, in samples chosen by a probability-proportional-to-z sampling plan 
(Cumberland and Royall, 1981). 

In other cases the prediction approach revealed a clear preference for one 
of two procedures where the randomization approach had been noncommittal. 
One example is in post-stratification, where some followers of randomization 
theory had chosen to condition the variance on the actual stratum sample sizes, 
while others had chosen to use the unconditional variance. The deadlock was 
described by Holt and Smith (1979), whose prediction theory analysis made clear 
the need to condition. 

Variance estimation for the ratio estimator provides another example of 
the activities in category (ii). Randomization theory had been unable to choose 
between two proposed variance estimators, yet model-based analyses revealed 
that the more popular of the two has a severe conditional bias. This bias is 
positive in some samples, leading to overly conservative confidence intervals, and 
negative in others, producing undercoverage. The second statistic is free of these 
biases. It is worth noting that empirical comparisons of these two variance esti- 
mators had also been inconclusive, because the investigators, guided by 
randomization theory, had averaged the results over all of the values of the 
conditioning variable, and had thereby averaged out the biases (Rao and Rao, 
1971). Empirical studies guided by prediction theory exposed the biases clearly 
enough to inspire efforts to accommodate the conditional results within 
randomization theory (Fuller, 1981; Robinson, 1987). 

The model-based approach has stimulated conventional sampling theory 
in other ways as well. For example, model-based results on variance estimation 
(Royall and Cumberland, 1981a) have inspired significant developments in 
conventional theory (Wu and Deng, 1983; Deng and Wu, 1987). At a more 
general level, the model-based approach has forced those who object to it to 
examine and articulate the reasons for their opposition (e.g., Hansen, Madow, 
and Tepping, 1983) and to extend and adapt the conventional theory to accom- 
modate those model-based results that they find compelling (e.g., the above-cited 
attempts to develop a conditional randomization theory for the ratio estimator). 
Another general effect on conventional sampling theory has been to create a 
greater awareness of models and willingness to use them in analyses. Very 
important work has been done in studying the effects of using standard computer 
packages (i.e., analyses based on simple models) to analyze sample survey data 
when the models do not adequately describe the process generating the 
observations. Some of this work has been model-based and some has been based 
on random sampling distributions, but stimulated by the model-based activity, 
and using models in the analysis (e.g., Holt, Smith, and Winter, 1980; Skinner, 
Holmes, and Smith, 1986). 
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Developments in category (iv) are of very general importance. The 
model-based approach brings new statistical methods to finite populations, 
methods that are widely used in other areas of statistics. These new applications 
represent important tests cases for the methods, which are now used in real 
samples from real populations that can be examined in toto to determine exactly 
how large the estimator error is, whether the true mean actually lies within the 
confidence interval, etc. Studying statistical methods in finite populations entails 
a degree of realism and relevance to real-world phenomena that is hard to achieve 
in other contexts, where the object of estimation is an unobservable (usually pure- 
ly conceptual) model parameter, or where the test data are generated artificially. 

This is illustrated by the finite-population tests of the standard variance 
estimates in linear regression models (Royall and Cumberland, 1981a, b). These 
empirical studies showed that the estimates are much more sensitive to errors in 
the models’ variance structure than had been generally acknowledged (see e.g. 
Efron, 1979 §7). This suggests that more attention should be paid to bias-robust 
alternatives. But further finite population studies have produced frightening 
examples showing that confidence intervals based on bias-robust estimates, 
although better than those based on the standard variance estimates, can also 
perform very poorly under conditions that, though not uncommon, are difficult 
to recognize when they occur (Royall and Cumberland, 1985). 

Finally, the model-based approach to finite population sampling has also 
helped to clarify the basic concepts and principles of statistics. Stimulated by the 
good advice “Look at the data,” along with exciting computer capabilities for 
display and analysis of samples, statisticians now rely heavily on the data to 
suggest and criticize models. Finite population studies have helped to emphasize 
the limitations of this sort of empiricism: model failure that is not apparent in 
the sample can produce seriously misleading inferences (e.g., Royall and 
Cumberland, 1981la; Rubin, 1983). Thus, robustness is vitally important even 
when the model fits the observed data well. Other important general issues that 
have been emphasized and illustrated in the model-based approach to finite 
population sampling include the critical distinction between probabilistic and 
inferential validity and the need for conditioning on ancillary statistics to achieve 
the latter (see Royall, 1976a, for discussion; ref. also Hinkley, 1983), the 
inferential inadequacy of probability distributions generated by artificial 
randomization, and the fundamental importance of likelihoods (Royall, 1976b, 
discusses the last two points). 


Some Current Developments 


The role of randomization in a model-based approach to finite population 
sampling is a subject of continuing research. Randomization is certainly valuable 
at the sampling stage. For example, it can ensure that the chances are good that 
the sample selected will be well balanced, so that in that sample a given 
estimator is robust with respect to variables that are not adequately accounted 
for by the prediction model (Royall and Herson, 1973). But just when and how 
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random sampling probabilities should influence inferences from a given sample 
has proved to be a difficult issue. On one hand, the set of labels identifying the 
sampled units is an ancillary statistic, so that the Conditionality Principle 
evidently precludes any role for the random sampling distribution in inference 
(Basu, 1971). On the other hand, the expected balance associated with simple 
random sampling is a characteristic whose statistical relevance does not seem to 
vanish entirely when the perspective shifts from (i) choosing which units to 
observe to (ii) making estimates from an observed sample (ref. Royall, 1976a, p. 
471). Thus there are continuing efforts to formalize and explain the precise role 
of random sampling in finite population inference (e.g., Sugden and Smith, 1984; 
Pfefferman and Holmes, 1985; Cumberland and Royall, 1988; Kott, 1988; and 
Tam, 1988) and to reconcile the prediction and randomization approaches 
(Brewer, Hanif, and Tam, 1988). 

But recent progress in model-based theory has not been limited to the 
interface with randomization theory. Tam (1986) has given an elegant extension 
and unification of earlier work on robust estimation. Chambers (1988) has con- 
tributed both theoretical and empirical results on model-based estimation for 
domains within a larger population. And Valliant has used the prediction 
approach to analyze the statistical properties of a widely-used method of variance 
estimation (1987a), to discover critical conditional properties of estimators in 
stratified samples (1987b), and to study an important problem in economic 
statistics (1988). 
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SAMPLING THEORY USING EXPERIMENTAL DESIGN CONCEPTS 
Jaya Srivastava, Colorado State University 
and 


Zhao Ouyang, Colorado State University 


Abstract 


In this paper, we consider the application of concepts of Statistical 
Experimental Design to Sampling Theory. As is well-known, because of its 
inherent nature, Experimental Design Theory involves a relatively heavy amount 
of Combinatorial Mathematics. It turns out that, over the years, relatively 
speaking, it is this combinatorial aspect of Design, that has found much 
application in Sampling. We present a brief review of the same, including some 
of the latest work in the field. 


Introduction 


The subject of sampling using experimental design concepts has attracted 
more and more attention in recent years. A very explicit connection was made 
by M.C. Chakrabarti (1963) who indicated that balanced incomplete block 
designs (BIBD’s) could be used as sampling schemes. At first, it was shown that 
a BIBD procedure has properties similar to SRSWOR (simple random sampling 
without replacement). But later on it was found that a BIBD corresponds, in a 
sense, to controlled sampling, which was proposed by Goodman and Kish in 1950, 
and to which further contributions were made by Avadhani and Sukhatme (1965, 
1968, 1973). 

Consider an agricultural survey. Suppose we use SRSWOR to draw a 
sample of n counties from a population of N counties. It may happen that the n 
counties in our sample are spread out in an undesirable or inconvenient manner. 
As pointed out by Avadhani and Sukhatme (1973), “this may not only increase 
considerably the expenditure on travel, but the quality of data collected is also 
likely to be seriously affected by non-sampling errors, particularly non-response 
and investigator bias, since in such cases organizing close supervision over the 
field work would generally be fraught with administrative difficulties”. Such a 


sample is considered as non-preferred. Hence the total set of a samples can be 


classified into two classes: preferred samples and non-preferred samples 
(Goodman and Kish, 1950). Hence, our objective is to design a sampling 
procedure which reduces the probability of drawing a non-preferred sample as 
much as possible, and at the same time resembles SRSWOR (assuming no 
stratification, clustering, etc. is present, and there are no auxiliary variables). 
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The problem of controlled sampling was first proposed by Goodman and 
Kish (1950). This method involves stratified sampling and emphasizes the 
minimization of the probability of the selection of the non-preferred samples. 
But, as discussed by Avadhani and Sukhatme (1973), this method may lose 
precision in estimation. In their three papers (1965, 1968, 1973), Avadhani and 
Sukhatme discuss the problem of minimizing the chance of selection of non- 
preferred samples without losing efficiency relative to SRSWOR. 

We recall some useful notation from Srivastava (1985). Let U denote a 
population with N units denoted by the integers 1, 2,...,N. Let y be the variable 
vs interest, and let y; (i = 1,...,N) be the value of y for the unit iin U. Let Y = 


2 y; be the population total. The class of all subsets of U is denoted by gY , and 


any w € 2% is called a sample of U. (This includes the empty pampe .) For any 
set K, let |K| denote the number of elements in K. For any w € 2”, let (win) be 
the class of all n-element subsets of w; if |w| < n, then this class is empty. A 
sampling measure, denoted by p(-), is a probability density {p(w)} defined on 
2”. For a given p(-), let 


T, = 2. p(w), 2S saN. (1) 


w: iE w 


Then, 7; (i = 1,...,N) is the probability that the unit z is included in the sample. 
For any non-empty sample w, let J „ denote the sample mean. Consider a 
sampling measure p for which all inclusion probabilities 7; (i = 1,...,N) equal 
(n/N). Then, Avadhani and Sukhatme define p to be admissible if (i) Ny,, is an 
unbiased estimator of Y, and (ii) Var,(Ny,) < Vargpc(Ny,,), where Var, and 
Varcpg denote the variance respectively under the measure p, and the measure q 
induced by SRSWOR with sample size n. (Note that, for all w € gu q(w) = 


(3) if |w| = n, and q(w) = 0, otherwise.) 
Let 3 < n < N-3. The following results are given by Avadhani and 
Sukhatme (1973). 


Theorem 1 


Let S C (U: n), and let |S| = b. Then the sampling measure which 
selects each w € S with probability (1/b) is admissible if and only if Hw: w € 
S, ij E w, i Æ J}l, are the same for all i # j, ij = 1,...,.N. For such a 
measure, |{w: w € S, i € w}| are the same for all i = 1,...,N. 

Under the condition of Theorem 1, let 


A= luwe Sij € wi Ej) (2) 


r= bn/N. (3) 
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It is easy to see that the existence of S in Theorem 1 is equivalent to the 
existence of a BIBD with parameters (N, b, r, n, à), such that N is the number of 
treatments, b the number of blocks, r the number of replications for each 
treatment, À the number of blocks which contain any given pair of treatments, 
and n the block size. In fact, such a S is a BIBD with the above parameters. 
But, when N and n are large, such a BIBD may be hard to identify. So, the next 
two theorems are useful. 


Theorem 2 


The measure induced by the following (two-part) sampling procedure is 
admissible: 


(i) Split the population randomly into k subpopulations with fixed sizes 
k 
N, (i= 1,...,k) such that > N; = N, 
i=1 


(ii) For i = 1,...,k, select n; units from the jth subpopulation by using an 
admissible sampling measure (with inclusion probability (n,/N,)). The 
selection of the units from the different subpopulations should be done 
independently. 


Corollary 1 


The measure induced by the following procedure is admissible: 
(i) Draw a sample of size n’ > n from the population by SRSWOR. 


(ii) | From the sample selected in (i), draw a sample of size n by using an 
admissible measure with inclusion probability n/n’ for each unit. 


In view of the above, Avadhani and Sukhatme suggest that the following 
steps may be followed for controlled sampling: 


(i) Let Ni + No +... + N, = N. Divide the original population 
randomly into g subpopulations, which have sizes N,, Nose Ny 
respectively. 


(ii) Let ny + ng +... + n, =n. For i= 1, 2,..,g, select an integer n; 
such that n; < n; < Ñ; and also select a BIBD with parameters (n’,, 
bi Ti np A;). (It is preferred that n; be much smaller than n.) Use 
SRSWOR to select (independently for each i) a sample of size ni from 
the i subpopulation of size N. 


(iii) For each sample of size ni, (i = 1,...,g) drawn in step (ii), collect the 
information on all the preferred subsamples of size n; and then find a 
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BIBD with parameters (nj, b; r; n; ,;) such that the number of the 
blocks which correspond to the preferred subsamples of size n; is as 
large as possible. Then draw one block with probability 1/b; from the 
i** BIBD independently for i = 1,...,g. In this way, we get a sample 
of total size n} +... + ny =n. 


An example of controlled sampling using BIBD will be given in the last 
section in this paper. 


Other Works on Sampling Using Concepts of Experiment Design 


In the first section, we discussed the use of BIBD in controlled sampling. 
It is very clear that for a BIBD with parameters (N, b, r, n, 4), N corresponds to 
the number of units in the population, b corresponds to the (maximum possible) 
number of distinct samples, and n corresponds to the size of the sample. With 
this interpretation, it is easy to see that the parameters r and à in the BIBD 
correspond respectively to the first order and the second order inclusion probabil- 
ities. So, for some time, the use of BIBD in sampling has been discussed widely. 

As early as 1963, Chakrabarti pointed out the equivalence between 
SRSWOR and BIBD in the sense of having the same first order and second order 
inclusion probabilities. It is clear that the smaller the support of (i.e., the 
number of distinct blocks in) the BIBD, the better is the possibility of adapting it 
for a given situation of controlled sampling. Thus, BIBD’s with a small support 
size have importance in sampling theory. Because of this, the work of Hedayat 
and others in the field of BIBD’s with small supports is useful. 

In 1977, Wynn showed that for each sampling measure p, there is a 
measure py, which gives rise to the same first and second order inclusion proba- 
bilities as p,, and whose support size is not greater than M(N - 1)/2. For the 
case of SRSWOR, he showed that no BIBD with support size less than N can be 
equivalent to SRSWOR in the above sense. Hence, with the help of BIBD’s we 


can reduce the support size from SRSWOR’s |) to something between (2) and 
N. 

Besides BIBD, Fienberg and Tanur (1985) listed some parallel concepts 
in Design of Experiments and Sampling. These include randomization in design 
and random sampling, blocking in design and stratification in sampling, Latin 
square in design and lattice sampling, split-plot design and cluster sampling, and 
covariance adjustment in design and post-stratification in sampling. By using 
some similar parallel concepts in design and sampling, Meeden and Ghosh (1983) 
found some admissible strategies in sampling and Cheng and Li (1983) showed 
that Rao-Hartley-Cochran and Hansen-Hurwitz strategies are approximately 
minimax under some models. Brewer et al. (1977) discussed use of experimental 
design in the planning of sample surveys, and Sedransk (1967) discussed the use 
of experimental design in the analysis of sample surveys. But, even though 
experimental design and sampling have so many parallel concepts and similar 
structure, sampling has been developed separately from experimental design. 
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Smith and Snyder (1985) pointed out the main distinction between experimental 
design and sampling from their nature of inference. They concluded that “the 
differences between survey and experiments are as important as the similarities, 
and that each will continue to develop in its own way”. An excellent discussion 
of experimental design and sample surveys, both with respect to their similarities 
and differences, was given by Fienberg and Tanur (1985). 

Hedayat (1979) gave a method for finding a sampling design which has 
the same first and second order inclusion probabilities, but has a reduced support 
size than SRSWOR. (In other words, he gave a general method for obtaining 
BIBD’s with relatively small support sizes.) Let M denote the incidence matrix 
of all the pairs (i, 7) versus all the samples of U with size n, where i, j € VU. 


Thus, M is a @ ) x (X )) zero-one matrix. Suppose all the samples of U with 


size n are arranged i in a list in an arbitrary but fixed order. Consider a BIBD 
(with block size n) in which f, denotes the frequency of the th sample in the 


above list. Let f = (fis fz» A®)). Consider a sampling measure p which assigns 
probability (f;/ 2 f,) to the kt? sample. Then, p has the same first and second 


3 
order inclusion probabilities as SRSWOR of size n iff Mf = Al, where X is a 
positive integer and 1 is a column vector with all entries equal to 1. So each 
feasible solution of the system 


Mf=A1,f = 0 (4) 


gives a sampling measure equivalent to SRSWOR of size n. Notice that there is 
always a solution for the system. So we can introduce another quantity, for 
example, the number of non-zero entries in f, and find a feasible solution of the 
system to minimize the quantity. The algorithm of mathematical programming 
can be used to get such a solution. In other papers in combinatorics, Hedayat 
and others give further results. 

In Hedayat and Pesotan (1983), (R x L) triply balanced matrices was 
discussed. The (R x L) triply balanced matrices arise in estimating the mean 
square error of nonlinear estimators in sampling. Briefly, a (R x ZL) triply 


Pees matrix is A = (6;) with entries +1 or -1 such that ae rh = 0, 
R 
Ebb = = 0, 2? rhors°rt = 9, Where the h, s, t are distinct and h, s, ea Trash: 


It was proved that a (R x L) triply balanced matrix A is an orthogonal array of 
strength 3 and 2 symbols. 

In Hedayat, Rao, and Stufken (1988), balanced sampling plans excluding 
contiguous units are discussed. In some situations, the N units of the population 
are arranged in a natural order. In this case it may happen that contiguous units 
provides us similar information so that it seems more reasonable to select a 
sampling plan such that the contiguous unit cannot appear in the sample. Here 


246 J. Srivastava & Z. Ouyang 


the term balanced means that the first and second order inclusion probabilities 
are fixed. The condition of the existence of such a sampling measure is given in 
this paper, and a method of constructing such a sampling measure is also 
proposed. 


Use of t- Design 


Suggested by the usefulness of BIBD with sampling, the use of t-design in 
sampling was proposed by Srivastava and Saleh (1985). A BIBD, which has the 
same inclusion probabilities (of individual units, and pairs of units) as SRSWOR, 
has the same moments as SRSWOR up to order two. Generalizing this, 
Srivastava and Saleh showed that a tdesign has the same moments as SRSWOR 
up to order t, because it has the same inclusion probabilities as SRSWOR up to 
order t (i.e. every set of i units (i = 1,...,¢) has the same inclusion probability, 
say q;). Also, as for the BIBD, the sample space under a ¢t-design can be much 
smaller than the sample space under SRSWOR. Thus, using ¢-designs we can try 
to avoid non-preferred samples, and still maintain resemblance to SRSWOR up 
to moments of order t. 

For later use, define a, (i E€ U,w € 2") by 


ayy = 1, if 2 E Ww 
= 0, otherwise. (5) 
Let 1 < k < N. For any sampling measure {p(w): w € oY) define 
Thine. i) = Do P(w)a; 4; ye Ay ys (6) 
w A> 2 k 
where 2, %)...,2, E U. 
In this section, we suppose the sample size is always equal to n, a fixed 
integer. We are interested in estimating the population total Y. 
The following results from Srivastava and Saleh (1985) are useful in the 


studies on using ¢design theory in sampling. 


Lemma 1 


Let 2 < k < n. Suppose i,,...,2, are distinct elements of U. Then we 
have 


N 
Do Tihs. yt) = (n-— k + alis. ig), 


1=1 
i É isenip (7) 


N 
Dali) = n. (8) 


= 
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This lemma says that for 2 < k < n, the inclusion probabilities of 
order j(1 < j < k-1) are determined by the inclusion probabilities of order k. 


Theorem 3 


Suppose there are two different sampling measures on 2U. Let tbea 
positive integer. ‘Then these two sampling measures give the same inclusion 
probabilities of order t if and only if these two sampling measures give the same 
values of EGE), k= 1,..., t, for all possible values of (y,,...,4,,)- 


Let 
p(w) = dX (Ii Ju)? (9) 
2 =H E U-I = H ho). (10) 
Ew 


Then, we have 


Theorem 4 


Consider two sampling measures on 2U, Consider the following four 
conditions: 


(i) For all possible values of y = (y,,...,9,)', E(%,,) is the same under 
these two sampling measures, 


(ii) For all possible values of y, E(¥%,), or E(s2), or V(¥,) is the same 
under these two sampling measures, 


(iii) For all possible values of y, cov(¥,, s%,) is the same under these two 
sampling measures, 


(iv) For all possible values of y, V(s2) is the same under these two 
sampling measures. 


Let t be an integer such that 1 < t < 4. Then the above conditions (i), (ii), up 
to (t) are true if and only if these two sampling measures have the same inclusion 
probabilities of order t. 

One can generalize Theorem 4 to higher order. But the most important 
case is order 4. In this case, we can characterize the mean and the variance of a 
linear estimator, and characterize the variance of a quadratic estimator of the 
variance of the linear estimator. 

Now consider a ¢-design D(N, n, t, b) where N is the number of varieties, 
n the block size, 6 the number of blocks (which may or may not be distinct), and 


where every combination of t varieties (t < u) occurs in x") i, c) blocks. 
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Consider a sampling measure (called a t-design sampling measure) which selects 
each block of D(N, n, t, b) with probability 1/b. When b = (7) and each block 
in KN, n, t, W) is distinct, this sampling measure becomes SRSWOR. In this 
case SRSWOR is a t-design KN, n, t, (D) where t can take any value from 1 to 


n. 
For the t-design sampling measure mentioned above, for distinct ù, ..,t 


€ U, we have 
stani 449) /0)= (9/00) an 


Hence we have the following theorem. 


Theorem 5 


SRSWOR (with sample size n) and the tdesign sampling measure have 
the same inclusion probabilities of order t and hence have the same moments up 
to order t. 


For any tdesign D(N, n, t, b), the number of distinct blocks is not 
greater than H and usually is much less than (7) This makes a t-design useful 


in controlled sampling. In fact, a BIBD is a 2-design. Because we need to 
estimate V(¥,,), we need to consider up to the fourth moments; the first two 
moments are not enough. In view of this, Srivastava and Saleh assert that it 
would be much better to use 4-designs rather than BIBD’s, since the former gives 
rise to the same moments as SRSWOR up to order 4. 


Connection with Arrays 


The theory of factorial designs constitutes a major part of the whole 
subject of experimental design. Furthermore, the modern theory of factorial 
designs is largely built around the concept of arrays. Indeed, arrays constitute a 
very important tool in all of design theory, since for example, BIBD’s, PBIBD’s 
and ¢designs, etc. may (through their incidence matrices) be studied in terms of 
arrays. Because of this, in this section, we discuss the application of arrays in 
sampling theory. An array is a matrix whose elements come from a finite set. 
Suppose the finite set has m elements in it. Without loss of generality, we use 
the integers 0, 1,...,m—1 to denote the elements of the finite set. In this case, an 
array is a matrix whose elements belong to the set {0, 1,...,.m-1}. When m = 2, 
such an array becomes (0, 1) matrix which is of special importance. 

A special case of a (0, 1) matrix is the incidence matrix of a class of 
subsets of a given finite set. The rows of an incidence matrix correspond to the 
elements of the given finite set and the columns correspond to the subsets of ne 
given finite set. In sampling, an incidence matrix is Qy which T a(N x 2 M 
(0, 1)-matrix such that its columns correspond to the elements of 2U, and rows to 


SAMPLING THEORY 249 


the elements of U. In order to simplify the discussion, and without loss of 
generality, we assume that the i row of Q y corresponds to the element 2 of U, 
and the j* column of Qy corresponds to the j* element of 2” such that the 
elements of 2” are arranged in the following standard order: 


(i) IRfw,,w, € 2” and |w,| < |w,|, then w, precedes wo; 
(ii) If |w,| = jwg] but there exists a k € U such that |{1,...,k} N wil > 
[{1,...,k} N wo] and |{1,...,2} NM wl = [{1,...,2} N wel for 
0 < l < k, then w] precedes wo. 
In this way, the elements of 2” are arranged as 


{w(0), w(1),...,w(2% - 1)} (12) 


and i € w/(j) if and only if the itt coordinate of the j? column of Q yis equal to 
1. For N = 3, Qy is equal to 


01001101 
00101 01 1 4. (13) 
0001011 1 


Now, given any sampling measure {p(w): w € 207, we can rewrite it as a 
vector which is called the vector form of sampling measure 


p' = (p(w(0)), C), -pll - 1) (14) 


Combining Qy and p’, we have a matrix r ,(p) where 
Q 
yp) = a (15) 


Thus, 7 y(p) presents a sampling measure in a matrix form. Now suppose all the 
p(w(j)) are rational numbers and p(w(j)) = v;/v such that v; is a non-negative 
integer, v = Dv, (where the sum runs over all /, also suppose there is no common 
factor other than 1 among the v, (J = 0, 1,...,2"-1). Suppose 


Qly = [cos Cire Enl (16) 
Now we introduce another matrix A (p) such that 


A yp) = [sol alf 


v 


/ 
eee Cc Nn 1 
| 2-1 Von 


| (17) 
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where 1; is the (1 x k) vector containing 1 everywhere, and where if for any j, we 
have v, = 0, then the columns (c.) do not appear in Ag(p). Now, drawing a 
column from A,{p) with probabili ity (1/v) is equivalent to mere a column 
from Ny with probability measure {p(w(j)) = v,/v, j = 0, 1,. 2"_1}. So, the 
matrix A (p) represents the sampling measure in the form ars an array and it is 
called sampling array in Srivastava (1988), wherein the following result is proved. 


Theorem 6 


For any vector form of sampling measure p and € > 0, there exists a 
vector form of sampling measure p* whose elements are rational such that 
(p - p*)' (p - p*) < e. (Note that every sampling measure can be expressed in 
the vector form.) 

Although this theorem seems simple it has an important interpretation in 
that we can replace a sampling measure by a rational sampling measure as 
closely as we want. On the other hand, by using a rational sampling measure we 
get a sampling array. So the above theorem connects sampling theory to the 
theory of arrays in a fundamental manner, and hence to factorial and other 
experimental designs. 

Now consider the problem of estimating the population total Y by a 
general linear estimator k (G means — where 


=) C; iwi = 2 Ci 4; wti (18) 


Ew 


and waere C;„ are known real numbers which depend on i and w for all: € U, w 
eo". Define 


bic = Lo Cin tiwP) (19) 

Pie = = is tiwP(W)> $f je = Dy Cia jes Mien Aw P(¥) (20) 
De = (Prer+sPye)'s BF = (PR NX IN (21) 
D, = Do- (Ime. + $A) + Inv (22) 


where Jpn is am x n matrix which elements are equal to 1. It is easy to check 
that 


= Lew)U, Uc, (23) 


where 
Ue, = (wtw l, Cow l2u ls. 4 CNw 41). (24) 


We have the following theorem: 
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Theorem 7 


The mean square error of Yg as an estimator of Y denoted by MSE(Y<) 
is 


MSE(Yg) = Y'®. Y. (25) 


Notice that the matrix ®, is known when the sampling measure and the 
estimator Yq are selected. The matrix ©, in sampling theory is similar to the 
information matrix in the theory of experimental design. 


A General Estimator 


In this section, we consider an estimator proposed in Srivastava (1985). 
There is an interesting history relevant here. First, in 1985, Srivastava observed 
the connection between combinatorial arrays and sampling theory, discussed in 
the last section. This appeared to open up a quite new theoretical field, in which 
variable sample size appeared to be inherent. Thus, there seemed to be a need 
for a general estimator in which sample size was not necessarily fixed. Now, 
most estimators in sampling theory relate to fixed size. In many ways, the most 
general estimator (which, among other things, allows variable sample size) 
existing in 1985 was the Horvitz-Thompson estimator. But this is entirely 
dependent on the sampling measure, which is of course decided upon before the 
sample is drawn. In an attempt to be able to utilize the new knowledge 
(independent of the sample, but obtained during the course of actual sampling) 
the concepts of the sample weight function (discussed below), and the estimator of 
this section, were discovered. This estimator is extremely general, in that most 
of the known estimators turn out to be its special cases. 

The most important concept in this estimator is the introduction of the 
sample weight function r, defined on A such that for all w € Je r(w) is a finite 
real number. For every K C U, and k € (1, 2,...,N), let 


(K: k) = {w: w C K, |w| = k}. 


Clearly, if ¿ € (U: k), then ¿is a k-tuple, with k distinct elements from U. From 
here on, > will denote the sum over all 1 € (U: k), >> will denote the sum 
á wi 
over all w € 2” such that i C w, and 2 will denote the sum over all į € 
IW 
(w: k). Note that the last sum could be empty. In this section, we always look 
upon 4 = (2,,...,2,) as an unordered set {i,,...,i,}. Let 


mi) = D ou): (26) 


For k € (1, 2,...,.N-1), t € (1, 2,...,M), and i € (U: k), let T,(i, t) be 
the class of all unordered sets j = (Jo;...,J,-4) such that j € (U: t) andi C j 
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where i = (1,,...,2,), and t(j) # 0. Let 
v(i ù =| Ti, 0) (27) 


Also, let a(i, t) be real numbers which satisfy the following two 
conditions: 


a(i, t) = 0, if v,G, t) = 0, and (28) 
N 
2 a(i, t) =1. (29) 
t=1 


For allw € 2, i € (U: k), define 


N 
B,(, w) = rw 2a t, t v,(2, t "lr, ] 
) Malis dfe ME OT) m 


where a = a! ifa # 0 and a = 0 if a = 0, and =* runs over all j E (U: t) 
such that i = (%,...,%,) C j and j C w. Now, consider the estimation of the 
following symmetric linear population function Q(wW) where 


Av) = 2 Y0), (31) 


where 7, defined over (U: k), is such that for allt € (U: k), y(i) is a real 
number. Notice when i C w and w is selected, y(i) can be calculated. Thus, 
once a sample w is drawn, we can compute Q°”(y) where 


Q°"(b) = Z H(i) ,(i, w). (32) 


Here in QST, s means that we are estimating a symmetric function, and r means 
that the sample weight function ris being used. 


Theorem 8 


The statistic Q°”(4) is an unbiased estimator of Q(¥), if and only if for 
every i E€ (U: k) with y(i) # 0, there exists a t such that 1 < t < N, v,(4, t) 
#0. 

For the case of m (i) # 0 for alli € (U: k), let a(i, k) = 1 and a(ċ, t) 
= 0 for all t # k. Then we have 


Bli, w) = wf f" ana (33) 
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QY) = rw) 2, ¥(4)/ m (2). (34) 


By using Theorem 8, it can be checked that (34) is unbiased for Q(4). 
; The variance of Q°™(y) and an unbiased estimator of the variance of 
Q’"(w) were also obtained in Srivastava (1985). Now we turn to an estimator of 


the population total. N 
Let k = 1, i = 1, y(i) = y; Then Q(4) = J; y;= Y. Then, (34) gives 
i=1 


Q°"(¥) = Kw) E v/T = Y,,, say. (35) 


A 


By Theorem 8, if 7,(7) # 0 for all: € U, then Y,,, is an unbiased estimator of 
Y. When 


rw) =1,forallw € 27, (36) 
,(1)’s become 7,’s where 7; is the probability such that the unit 7 is included in 


the sample. At this time, Y, becomes the well known Horvitz-Thompson 
estimator Yr where 


ww 


The variance of Y 


sri 15 given in the following theorem. 


Theorem 9 
Suppose 7,(1) # 0, i= 1,...,N. Then 


2 
ar( 3 = ~ ali) - (70) N pe 


(38) 


where 
T a(i) = > p(w)[r(w)]?; i= 1,....N (39) 


T (ij) = D p(w)[r(w)]?, i #, ij = lp. N. 
wij (40) 
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It is easy to see that 


T a(i) ATD L/T; i= 1,...,N. (41) 


So the term in y? in Var( Y) is always larger than the correspondent term for 


Yy7 But we can choose r(w) such that the cross product terms of Y,,, are small 
so that Var(Y,,,) is small. Examples are given in Srivastava (1985). 


Balanced Array Sampling 


We have defined arrays in the fourth section. Let K(a x b) and 
k(a x 1) be a matrix and a vector with elements from ø, where ø, is a finite set 
whose elements are (0, 1,...,s-1). The symbol A(-,-) is defined as a counting 
operator, such that A(k, K) is equal to the number of times k occurs as a column 
of K. Let y, be the permutation group over o,. For y E y,, andj E a,, let 
wW(j) be the image of j when the permutation y is applied. Similarly, we define 
Ylik) = (Yki). .P(k,)) if k = (k,....k,) is a (a x 1) array over o,. 


Definition 1 


Let K be a (a x b) array over o, Then K is a balanced array (B-array, 
or BA) of strength t if and only if 


Alkos Ko) = A (¥(Ko), Ko) (42) 


where kp is any (t x 1) array over o,, Kp is any (t x b) subarray of K and ¢ is 
any permutation in Y, 

Balanced arrays play an important role in factorial experimental design 
and coding theory. For i = (t,...,¢,) E€ (U: k), define 


(ty)... i) = (4) = } p(w). (43) 
we 
When k = 1 or 2, the following customary notations will be used instead of 7(1), 
(ij) | i 
m; = a(i), T = a(i): (44) 


Definition 2 


Let p(-) = {p(w): w € 2%} be a sampling measure. Then p(-) 
corresponds to balanced array sampling with strength k iff T(t- +989) is fixed, for 
all possible CERA € (U: g). Here, g = 0, 1,...,k. 

Thus, if p(-) corresponds to balanced array sampling with strength k, 
then there exists 0,,...,0, such that 
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T(t- -52y) = 0 y» (45) 
for (215. . +529) E€ (U: g), and g = 0, 1,...,k. 


Theorem 10 


Suppose, the measure p(-) corresponds to BA sampling with strength k. 
Then there exists a sampling measure p*(-) whose sampling array is A,{p*) such 
that A y(p* ) is a B-array of strength k, and p*(-) is arbitrarily close to p(-). (In 
the sense of Theorem 6.) 


Theorem 11 


Suppose AvP) i is (N x v) B-array of strength k. Let p(-) be a sampling 
measure such that it gives a probability (1/v) to each column of A7,(p) for being 
selected. Then p(-) corresponds to balanced array sampling with strength k. 

Let 6, and 6, be the mean and the variance of the sample size under the 
measure p(-), i.e., 


6 = D p(o) (46) 
8, = D olol - 6). (47) 


Then we have the following theorem. 


Theorem 12 
Consider BA sampling whose inclusion probability is given by (45). 
Then 
Yyr = lols (48) 
i 6 N26 
WYgn) = sit {(w - 6) -2| — Y (49) 
1 1 ôi 
= PN h - 1) + êz [MPY - S? 
6, N 4, 
where Y = 3 Vis g = WoT Do- y are respectively of the population 


mean and variance. (The jae of this result lies in the fact that if we have 
some idea of the value of Y, we can reduce the variance below that of SRSWOR. 
This may happen, for example, in recursive sampling.) 
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Definition 3 


Let p(-) be a sampling measure. Then, p(-) corresponds to proportional 
array sampling with strength k (or, briefly, proportional sampling) iff for all 
integer g such that 1 < g < k, and all (i,...,2,) € (U: g) we have 


a(i.) = (i) <. (i). (50) 


Notice that when 7, is fixed, say 0, for all: € U, then the proportional 
array sampling with strength k is also balanced array sampling with strength k. In 
this case we call it balanced proportional sampling with strength k. 

In order to construct a p(-) which corresponds to proportional array 
sampling with strength k, we need the definition of orthogonal array (OA). 


Definition 4 


Let K be a (a x b) array over a, Then K is an orthogonal array of 
strength ¢ if and only if 


Alko Ky) =b x S (51) 


where ky is any (t x 1) array over o,, Kọ is any (t x b) subarray of K. It is 
easy to see that an OA with strength tis a BA with strength t. 

Let L(N x b) = (@,,...,€,)’ be an OA of strength k over o, where s is a 
prime number. Let s; be an integer satisfying 1 < s; < s,i=1,...,N. In &, 
replace the (s; —- 1) symbols {2, 3,...,s,} by 1, leave the original 1 unchanged, and 
replace the other symbols (if any) by 0. Notice when s; = s, then the symbol s; is 
the same as symbol 0. Let L(N x b) be the array obtained by the above 
replacement. 


Theorem 13 


Consider a sampling measure p(-) such that it has [(N x b) asa 
sampling array. Then p(-) corresponds to proportional sampling of strength k, 
such that the inclusion probability of unit 7 is equal to s,/s, for i = 1,...,N. 


Theorem 14 
We have 
$ N ofl 
var Yar) = $ v2 - 1) (52) 
for proportional sampling and 
var(Yy7) = G - 1) (N - 1)s? + NY} (53) 


for balanced proportional sampling. 
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We can use BA sampling with strength 4 to imitate SRSWOR up to the 
moments. Notice that the binomial sampling referred to in the literature, is a 
balanced proportional sampling with strength N. It is clear that it should be ade- 
quate enough to use balanced proportional sampling with strength 4 instead of 
using binomial sampling. 


4th 


Weight Balanced Sampling 


. Now we introduce an estimator of Y called Yo which is a special case of 
Y,, when 


r(w) = lu} forallw € 2% w Æ ¢. (54) 
n p(w)a; 
i= ey i= Lea (55) 
wo || 
T! =. Pw) tio tS dealt (56) 
ow |e 
p(w) a; a AAA 


oW i (57) 


where we assume that empty samples are not allowed. 


Theorem 15 
Suppose 7; > 0 for i =1,...,N. Then 


Y= lw? > Yil Ti (58) 
Ew 


E(Y,9) = (59) 


var( Y,o) = PN Fol- ) (60) 
(7;) iT; 


Notice that when the sample size is fixed, Y 25 = Yup 


Definition 5 


A sampling measure p(-) corresponds to weight-balanced (WB) sampling, 
if and only if (x? /(7!)?) and (7” ./m;m;) are constants fori € Uandi # j, ij € 
U respectively. 
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Let 
T! /(m')* = By, for all i (61) 


m5 | WiT = Bo, for allt Æ j, ij E€ U. (62) 
We have the following corollary of Theorem 15. 


Corollary 1 
Under WB sampling, we have 


VWF aa) = (N- 1)S?(B, - Bg) + NYB - Bg) + NB - 1)I- (63) 
Definition 6 


A sampling measure p(-) corresponds to strongly weight-balanced (SWB) 
sampling if and only if 7’, T’, Ti; are constants for: € U andi Æ j, 147 E U 
respectively. 


Let 
T. = Bs i€ U, and (64) 
N 
Bo = Yol = Er (65) 
j= 
Theorem 16 


For SWB sampling, we have 


var( Y,a) = N°S%( By - 4). (66) 


Suppose gin) > 0, n = 1,...,N and È q(n) = 1. Suppose we draw a 


sample in this way: firstly select the saapi aie n with probability (n), then 
use SRSWOR to draw a sample of size n. Then use Ny,, to estimate the 
population total Y. In this way, we select a particular sample of size n with 


probability g(n) (7). We have 


N 
van Nio) = E a(n) (Nio = YP] lol = | (67) 


È q(n) N*S? (5 - 1) 


= #8 (60-4) 
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Hence, the technique of using Ya to estimate the population- total in 
SWB sampling is a technique which imitates SRSWOR. 
An estimator Yç is said to be location invariant if and only if 


Yg (given that y = y*) = -yoN + Yg (given that y = y* + yoJ,y) (68) 


a N 
for all real yg, when y = (y,.-.,¥y). It is easy to see that Yg = (x Cisti) is 
5 i=1 


N 
location invariant iff > c = N foral w e 24. 


iw iw 


Theorem 17 


a 


Under SWB sampling, Y, is location invariant. 

The material in this section comes from Srivastava (1987), where 
examples of WB are given. From an unpublished paper of Srivastava and 
Ouyang (1988), we know that ee is an admissible linear estimator of Y, and has 
a variance formula which is similar to the Yates-Grundy variance formula for 


Var( Y yr) when the sample size is fixed. 


An Example of Controlled Sampling and BA Sampling 


Now we discuss an example given by Avadhani and Sukhatme (1973) in 
controlled sampling. Let N = 7, and suppose these seven units are located as in 
the diagram below: 


Here, any two units which are connected by a line are considered as neighbors. 
We are going to get a sample of size 3 from these 7 units. In order to reduce the 
travel cost, we hope the sample we get consists of neighboring units. So a sample 
w = {t,, i, i3} is considered preferred if and only if after a suitable permutation, 
there is a line between 1, and 1, and also there is a line between i, and i}. So the 
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total number of preferred samples is 21, and the total number of possible samples 


is (2) = 35. 
Consider a BIBD with parameter N = 7, k= 3, b= 7, r= 3, v=1: 


1 1 
2 5 (69) 
3 6 


Now only the block correspond to column 7 is not preferred. Hence if we use 
probability 1/7 to draw a column from T, we reduce the probability of drawing a 
non-preferred sample greatly, and at the same time we have the same first two 
moments as SRSWOR. But this technique does not avoid the nonpreferred 
samples totally. To avoid the nonpreferred samples totally, consider a balanced 
array approach as follows. We have a list of 16 samples: {147}, {246}, {543}; 
{125}, {257}, {576}, {763}, {631}, {321}, {15}, {27}, {56}, {73}, {61}, {32}; 
{5}. With probability (1/11) we draw any one of the first three samples, and 
with probability (1/22) we draw any one of the remaining samples. Hence we 
avoid the nonpreferred samples. But we use some subsamples of the preferred 
samples. 

The problem of controlled sampling may be approached through the 
concepts of array sampling as follows. 


(i) Decide the preferred and nonpreferred samples. 

(ii) | Decide whether fixed sample size should be used or not. 

(iii) | Consider using BA sampling or WB sampling. 

(iv) Suppose BA sampling is used. Then we need to find a BA whose 
columns consist of the preferred samples. If we fail to get such a BA, 
then consider subsamples of these samples. Sometimes we have to 
change the decision in step (ii) to consider using some non-preferred 
samples in this step (with minimal probability). 
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